AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3
- URL: http://arxiv.org/abs/2512.18399v1
- Date: Sat, 20 Dec 2025 15:32:10 GMT
- Title: AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3
- Authors: Mark Kashirskiy, Artiom Lipinski, Ilya Makarov,
- Abstract summary: We present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm.<n>We show that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines.<n>Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples.
- Score: 4.284434049360481
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. Furthermore, we introduce the Language Extension Pipeline (LEP), a method for integrating the optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing. Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. We release our tokenizer, training scripts, and model checkpoints to facilitate Arabic NLP research.
Related papers
- Simultaneous Speech-to-Speech Translation Without Aligned Data [52.467808474293605]
Simultaneous speech translation requires translating source speech into a target language in real-time.<n>We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely.<n>Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks.
arXiv Detail & Related papers (2026-02-11T17:41:01Z) - Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha [0.1019561860229868]
Most pre-trained tokenizers are suitable for high-resource languages like English but perform poorly for low-resource languages.<n>This study evaluates the training and performance of three common tokenization algorithms in comparison to other popular methods.<n>The results show that while all three algorithms demonstrate potential, SentencePiece is the most effective for Dzongkha tokenization.
arXiv Detail & Related papers (2025-09-18T07:02:55Z) - Tokens with Meaning: A Hybrid Tokenization Approach for NLP [0.2826977330147589]
Tokenization plays a pivotal role in natural language processing (NLP)<n>We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation.<n>The method uses phono normalization, root-affix, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.
arXiv Detail & Related papers (2025-08-19T22:17:42Z) - Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment [8.097278579432908]
The choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings.<n>While better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm.
arXiv Detail & Related papers (2025-08-11T19:23:59Z) - Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [53.22544362024936]
Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines.<n>Standard algorithms for learning tokenizers rely on frequency-based objectives.<n>We introduce Parity-aware Byte Pair.<n>We find empirically that Parity-aware BPE leads to more equitable token counts across languages.
arXiv Detail & Related papers (2025-08-06T18:14:43Z) - Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z) - MorphTok: Morphologically Grounded Tokenization for Indian Languages [18.594241501479747]
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs)<n>We propose morphology-aware segmentation as a pre-tokenization step before applying the classical Byte-pair.<n>To handle the dependent vowels common in syllable-based writing systems, we propose Constrained BPE (CBPE)<n>CBPE handles dependent vowels to form a cohesive unit with other characters instead of occurring as a single unit.
arXiv Detail & Related papers (2025-04-14T15:44:45Z) - Splintering Nonconcatenative Languages for Better Tokenization [4.496923806879088]
We present SPLINTER, a pre-processing step which rearranges text into a linear form.<n>We demonstrate its merit using both intrinsic measures evaluating token vocabularies in Hebrew, Arabic, and Malay.
arXiv Detail & Related papers (2025-03-18T17:11:09Z) - SuperBPE: Space Travel for Language Models [103.09169510391972]
We introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm.<n>SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average.<n>Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks.
arXiv Detail & Related papers (2025-03-17T17:53:23Z) - MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost.
We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z) - Better Than Whitespace: Information Retrieval for Languages without
Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms.
We use the WordPiece tokenizer, which can be built automatically from unsupervised data.
Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.