Training Language Models with homotokens Leads to Delayed Overfitting
- URL: http://arxiv.org/abs/2601.02867v2
- Date: Tue, 13 Jan 2026 10:45:36 GMT
- Title: Training Language Models with homotokens Leads to Delayed Overfitting
- Authors: Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu,
- Abstract summary: Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning.<n>We formalize homotoken-as a strictly meaning-preserving form of data augmentation.<n>In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure.<n>In multilingual fine-tuning, we find that the effectiveness of homotokens depends on tokenizer quality.
- Score: 2.531076482407163
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning, yet induce different internal computations. Despite this non-uniqueness, language models are typically trained using a single canonical longest-prefix tokenization. We formalize homotokens-alternative valid subword segmentations of the same lexical item-as a strictly meaning-preserving form of data augmentation. We introduce a lightweight training architecture that conditions canonical next-token prediction on sampled homotoken variants via an auxiliary causal encoder and block-causal cross-attention, without modifying the training objective or token interface. In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure and improves generalization across diverse evaluation datasets. In multilingual fine-tuning, we find that the effectiveness of homotokens depends on tokenizer quality: gains are strongest when canonical tokens are highly compressed and diminish when the tokenizer already over-fragments the input. Overall, homotokens provide a simple and modular mechanism for inducing tokenization invariance in language models.
Related papers
- LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers [76.59130257385826]
Intermediate merge residues in BPE vocabularies are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage.<n>We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens.<n>Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.
arXiv Detail & Related papers (2026-02-04T16:19:05Z) - Lossless Vocabulary Reduction for Auto-Regressive Language Models [21.015330660860865]
Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models.<n>We establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary.<n>As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.
arXiv Detail & Related papers (2025-10-09T11:38:48Z) - Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z) - A Variational Framework for Improving Naturalness in Generative Spoken Language Models [52.673912922590866]
We propose an end-to-end variational approach that automatically learns to encode continuous speech attributes to enhance semantic tokens.<n>Our approach eliminates the need for manual extraction and selection of paralinguistic features.<n>It produces preferred speech continuations according to human raters.
arXiv Detail & Related papers (2025-06-17T17:58:17Z) - Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z) - Language Models over Canonical Byte-Pair Encodings [56.09166157337198]
We propose methods to enforce canonicality in token-level language models.<n>We show that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.
arXiv Detail & Related papers (2025-06-09T17:26:14Z) - Canonical Autoregressive Generation [17.065618029171766]
We show that large language models do not always generate canonical token sequences.<n>We introduce canonical sampling, a simple and efficient sampling method that precludes a given model from generating non-canonical token sequences.
arXiv Detail & Related papers (2025-06-06T18:09:10Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement [5.223020867766102]
We investigate how different tokenization schemes impact number agreement in Spanish plurals.
We find that morphologically-aligned tokenization performs similarly to other tokenization schemes.
Our results indicate that morphological tokenization is not strictly required for performance.
arXiv Detail & Related papers (2024-03-20T17:01:56Z) - Lexinvariant Language Models [84.2829117441298]
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM)
We study textitlexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice.
We show that a lexinvariant LM can attain perplexity comparable to that of a standard language model, given a sufficiently long context.
arXiv Detail & Related papers (2023-05-24T19:10:46Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.