Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
- URL: http://arxiv.org/abs/2508.08424v3
- Date: Mon, 10 Nov 2025 14:36:09 GMT
- Title: Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
- Authors: Saketh Reddy Vemula, Sandipan Dandapat, Dipti Misra Sharma, Parameswari Krishnamurthy,
- Abstract summary: The choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings.<n>While better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm.
- Score: 8.097278579432908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for languages with complex morphology. In this paper, we conduct a comprehensive evaluation of tokenizers using small-sized BERT models -- from pre-training through fine-tuning -- for Telugu (agglutinative), along with preliminary evaluation in Hindi (primarily fusional with some agglutination) and English (fusional). To evaluate morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal two key findings for Telugu. First, the choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings. Second, while better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm. Notably, hybrid approaches that use morphological information for pre-segmentation significantly boost the performance of BPE, though not Unigram. Our results further showcase the need for comprehensive intrinsic evaluation metrics for tokenizers that could explain downstream performance trends consistently.
Related papers
- Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR [0.0]
This paper investigates the impact of using morphologically-informed tokenizers to aid and streamline the interlinear gloss annotation of an audio corpus of Yoloxchitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools.<n>We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible.
arXiv Detail & Related papers (2025-12-05T21:35:42Z) - Subword Tokenization Strategies for Kurdish Word Embeddings [13.148440909737358]
We develop a BiLSTM-CRF morphological segmenter using bootstrapped training from minimal manual annotation.<n>We evaluate Word2Vec embeddings across comprehensive metrics including similarity preservation, clustering quality, and semantic organization.<n> morpheme-based tokenization demonstrates superior embedding space organization, better semantic neighborhood structure, and more balanced coverage across morphological complexity levels.
arXiv Detail & Related papers (2025-11-18T17:33:32Z) - Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z) - MorphTok: Morphologically Grounded Tokenization for Indian Languages [18.594241501479747]
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs)<n>We propose morphology-aware segmentation as a pre-tokenization step before applying the classical Byte-pair.<n>To handle the dependent vowels common in syllable-based writing systems, we propose Constrained BPE (CBPE)<n>CBPE handles dependent vowels to form a cohesive unit with other characters instead of occurring as a single unit.
arXiv Detail & Related papers (2025-04-14T15:44:45Z) - From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time [8.28573483085828]
We leverage an annotated Danish morphological dataset to train a semi-supervised model for morphological segmentation.<n>We evaluate four distinct tokenizers, including two custom morphological tokenizers, by analyzing their performance intextly segmenting Danish words.<n>Our findings reveal that our custom-developed tokenizers substantially enhance morphological segmentation, achieving an F1 score of 58.84, compared to 39.28 achieved by a Danish BPE tokenizer.
arXiv Detail & Related papers (2025-04-02T09:26:02Z) - Tokenization is Sensitive to Language Variation [14.568179478275255]
Tokenizers split texts into smaller units and might behave differently for less common linguistic forms.<n>This might affect downstream LLM performance differently on two types of tasks.<n>We find that the best tokenizer varies on the two task types and that the pre-tokenizer has the biggest overall impact on performance.
arXiv Detail & Related papers (2025-02-21T09:58:54Z) - Why do language models perform worse for morphologically complex languages? [0.913127392774573]
We find new evidence for a performance gap between agglutinative and fusional languages.
We propose three possible causes for this performance gap: morphological alignment of tokenizers, tokenization quality, and disparities in dataset sizes and measurement.
Results suggest that no language is harder or easier for a language model to learn on the basis of its morphological typology.
arXiv Detail & Related papers (2024-11-21T15:06:51Z) - Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.<n>Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - Unsupervised Morphological Tree Tokenizer [36.584680344291556]
We introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words.<n>Our method is capable of inducing character-level structures that align with morphological rules without annotated training data.<n> Empirical results indicate that the proposed method effectively retains complete morphemes and outperforms widely adopted methods such as BPE and WordPiece.
arXiv Detail & Related papers (2024-06-21T15:35:49Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Categorizing Semantic Representations for Neural Machine Translation [53.88794787958174]
We introduce categorization to the source contextualized representations.
The main idea is to enhance generalization by reducing sparsity and overfitting.
Experiments on a dedicated MT dataset show that our method reduces compositional generalization error rates by 24% error reduction.
arXiv Detail & Related papers (2022-10-13T04:07:08Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Neural Token Segmentation for High Token-Internal Complexity [7.569526565230962]
Tokenizing raw texts into word units is an essential pre-processing step for NLP pipelines.
We propose a novel neural segmentation model which combines contextualised token representation and char-level decoding.
Our model shows substantial improvements in segmentation accuracy on Hebrew and Arabic compared to the state-of-the-art.
arXiv Detail & Related papers (2022-03-21T10:07:17Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.