Lossless Vocabulary Reduction for Auto-Regressive Language Models
- URL: http://arxiv.org/abs/2510.08102v1
- Date: Thu, 09 Oct 2025 11:38:48 GMT
- Title: Lossless Vocabulary Reduction for Auto-Regressive Language Models
- Authors: Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba, Tamao Sakao, Susumu Takeuchi,
- Abstract summary: Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models.<n>We establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary.<n>As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.
- Score: 21.015330660860865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.
Related papers
- Training Language Models with homotokens Leads to Delayed Overfitting [2.531076482407163]
Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning.<n>We formalize homotoken-as a strictly meaning-preserving form of data augmentation.<n>In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure.<n>In multilingual fine-tuning, we find that the effectiveness of homotokens depends on tokenizer quality.
arXiv Detail & Related papers (2026-01-06T09:57:00Z) - False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models [53.01170039144264]
Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages.<n>Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages?<n>We find that models with overlap outperform models with disjoint vocabularies.
arXiv Detail & Related papers (2025-09-23T07:47:54Z) - A Variational Framework for Improving Naturalness in Generative Spoken Language Models [52.673912922590866]
We propose an end-to-end variational approach that automatically learns to encode continuous speech attributes to enhance semantic tokens.<n>Our approach eliminates the need for manual extraction and selection of paralinguistic features.<n>It produces preferred speech continuations according to human raters.
arXiv Detail & Related papers (2025-06-17T17:58:17Z) - Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z) - Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language.
We show that Byte-Pair.
Match (BPE) and MaxPiece (WordPiece) fit within this framework.
An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - Learning Mutually Informed Representations for Characters and Subwords [26.189422354038978]
We introduce the entanglement model, aiming to combine character and subword language models.
Inspired by vision-language models, our model treats characters and subwords as separate modalities.
We evaluate our model on text classification, named entity recognition, POS-tagging, and character-level sequence labeling.
arXiv Detail & Related papers (2023-11-14T02:09:10Z) - Learn Your Tokens: Word-Pooled Tokenization for Language Modeling [11.40976202290724]
Language models typically tokenize text into subwords, using a deterministic, hand-engineered of combining tokens into longer strings.
Recent attempts to compress and limit context lengths with fixed size convolutions is helpful but completely ignores the word boundary.
This paper considers an alternative 'learn your word' scheme which utilizes the word boundary to pool bytes/characters into word representations.
arXiv Detail & Related papers (2023-10-17T23:34:39Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.