Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law
- URL: http://arxiv.org/abs/2507.22543v1
- Date: Wed, 30 Jul 2025 10:16:23 GMT
- Title: Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law
- Authors: Yanjin He, Qingkai Zeng, Meng Jiang,
- Abstract summary: We propose a method for determining the vocabulary size by analyzing token frequency distributions through Zipf's law.<n>We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness.
- Score: 15.78540876600952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenization is a fundamental step in natural language processing (NLP) and other sequence modeling domains, where the choice of vocabulary size significantly impacts model performance. Despite its importance, selecting an optimal vocabulary size remains underexplored, typically relying on heuristics or dataset-specific choices. In this work, we propose a principled method for determining the vocabulary size by analyzing token frequency distributions through Zipf's law. We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness. Extensive experiments across NLP, genomics, and chemistry demonstrate that models consistently achieve peak performance when the token distribution closely adheres to Zipf's law, establishing Zipfian alignment as a robust and generalizable criterion for vocabulary size selection.
Related papers
- Scaling LLM Pre-training with Vocabulary Curriculum [0.0]
We introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size.<n>Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities.<n>Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization.
arXiv Detail & Related papers (2025-02-25T07:18:29Z) - Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling [10.985444895887207]
We introduce Over-Tokenized Transformers, a framework that decouples input and output vocabularies to improve language modeling performance.<n>We uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance.<n>Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design.
arXiv Detail & Related papers (2025-01-28T14:15:42Z) - Zipfian Whitening [7.927385005964994]
Most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform.
In reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law.
We show that simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance.
arXiv Detail & Related papers (2024-11-01T15:40:19Z) - ParaICL: Towards Parallel In-Context Learning [74.38022919598443]
Large language models (LLMs) have become the norm in natural language processing.<n>Few-shot in-context learning (ICL) relies on the choice of few-shot demonstration examples.<n>We propose a novel method named parallel in-context learning (ParaICL)
arXiv Detail & Related papers (2024-03-31T05:56:15Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - A Sparsity-promoting Dictionary Model for Variational Autoencoders [16.61511959679188]
Structuring the latent space in deep generative models is important to yield more expressive models and interpretable representations.
We propose a simple yet effective methodology to structure the latent space via a sparsity-promoting dictionary model.
arXiv Detail & Related papers (2022-03-29T17:13:11Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Slice Sampling for General Completely Random Measures [74.24975039689893]
We present a novel Markov chain Monte Carlo algorithm for posterior inference that adaptively sets the truncation level using auxiliary slice variables.
The efficacy of the proposed algorithm is evaluated on several popular nonparametric models.
arXiv Detail & Related papers (2020-06-24T17:53:53Z) - Supervised Learning for Non-Sequential Data: A Canonical Polyadic
Decomposition Approach [85.12934750565971]
Efficient modelling of feature interactions underpins supervised learning for non-sequential tasks.
To alleviate this issue, it has been proposed to implicitly represent the model parameters as a tensor.
For enhanced expressiveness, we generalize the framework to allow feature mapping to arbitrarily high-dimensional feature vectors.
arXiv Detail & Related papers (2020-01-27T22:38:40Z) - The empirical structure of word frequency distributions [0.0]
I show that first names form natural communicative distributions in most languages.
I then show this pattern of findings replicates in communicative distributions of English nouns and verbs.
arXiv Detail & Related papers (2020-01-09T20:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.