FLEXITOKENS: Flexible Tokenization for Evolving Language Models
- URL: http://arxiv.org/abs/2507.12720v2
- Date: Wed, 23 Jul 2025 18:40:42 GMT
- Title: FLEXITOKENS: Flexible Tokenization for Evolving Language Models
- Authors: Abraham Toluase Owodunni, Orevaoghene Ahia, Sachin Kumar,
- Abstract summary: Language models (LMs) are challenging to adapt to new data distributions by simple finetuning.<n>This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation.<n>We develop byte-level LMs with learnable tokenizers to make tokenization adaptive.
- Score: 3.2749495104311874
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at https://github.com/owos/flexitokens
Related papers
- Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z) - Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z) - Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit [45.18582668677648]
We present a training-free method to transplant tokenizers in large language models.<n>We approximate each out-of-vocabulary token as a sparse linear combination of shared tokens.<n>We show that OMP achieves best zero-shot preservation of the base model's performance.
arXiv Detail & Related papers (2025-06-07T00:51:27Z) - HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization [50.27950279695363]
Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages.<n>A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data.<n>We propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding.
arXiv Detail & Related papers (2025-04-21T19:40:32Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Tokenization is Sensitive to Language Variation [14.568179478275255]
Tokenizers split texts into smaller units and might behave differently for less common linguistic forms.<n>This might affect downstream LLM performance differently on two types of tasks.<n>We find that the best tokenizer varies on the two task types and that the pre-tokenizer has the biggest overall impact on performance.
arXiv Detail & Related papers (2025-02-21T09:58:54Z) - Retrofitting Large Language Models with Dynamic Tokenization [3.608780819053423]
Current language models (LMs) use a fixed, static subword tokenizer.<n>This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English.<n>We propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text.
arXiv Detail & Related papers (2024-11-27T17:51:58Z) - Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LMs)<n>This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.<n>We introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.
arXiv Detail & Related papers (2024-10-11T23:30:42Z) - MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost.
We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z) - Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue.
Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.