Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
- URL: http://arxiv.org/abs/2407.13623v3
- Date: Fri, 1 Nov 2024 02:41:36 GMT
- Title: Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
- Authors: Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong,
- Abstract summary: Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size.
We propose three complementary approaches for predicting the compute-optimal vocabulary size.
Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes.
- Score: 46.440917272424315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the conclusion that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. Most LLMs, however, use insufficient vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work highlights the importance of jointly considering tokenization and model scaling for efficient pre-training. The code and demo are available at https://github.com/sail-sg/scaling-with-vocab and https://hf.co/spaces/sail/scaling-with-vocab-demo.
Related papers
- Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size [12.916861128475272]
We provide a theoretical analysis of the effect of vocabulary size on training dynamics.<n>We show that as vocabulary size increases, the training dynamics emphinterpolate between the $mu$P regime and another regime.<n>Our analysis reveals that in the LV regime, the optimal embedding LR to hidden LR ratio should roughly scale as $Theta(sqrtwidth)$.
arXiv Detail & Related papers (2025-06-17T23:57:30Z) - Self-Vocabularizing Training for Neural Machine Translation [15.700883057259931]
We observe that trained translation models are induced to use a byte-pair encoding subset (BPE) vocabulary iteration distinct from the original BPE vocabulary.
We propose self-vocabularizing training, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement.
arXiv Detail & Related papers (2025-03-18T02:21:07Z) - Scaling LLM Pre-training with Vocabulary Curriculum [0.0]
We introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size.
Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities.
Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization.
arXiv Detail & Related papers (2025-02-25T07:18:29Z) - OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models [55.63479003621053]
We introduce OWLS, an open-access suite of multilingual speech recognition and translation models.
We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling.
We show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models.
arXiv Detail & Related papers (2025-02-14T18:51:40Z) - Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling [10.985444895887207]
We introduce Over-Tokenized Transformers, a framework that decouples input and output vocabularies to improve language modeling performance.
We uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance.
Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design.
arXiv Detail & Related papers (2025-01-28T14:15:42Z) - Large Vocabulary Size Improves Large Language Models [28.83786065307658]
We investigate the relationship between subword vocabulary size and the performance of large language models (LLMs)
Experimental results show that larger vocabulary sizes lead to better performance in LLMs.
We introduce a simple method to use a new vocabulary instead of the pre-defined one.
arXiv Detail & Related papers (2024-06-24T10:27:07Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features.
With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Fast Vocabulary Projection Method via Clustering for Multilingual
Machine Translation on GPU [6.1646755570223934]
This paper proposes a fast vocabulary projection method via clustering.
The proposed method speeds up the vocab projection step itself by up to 2.6x.
We also conduct an extensive human evaluation to verify the proposed method preserves the quality of the translations from the original model.
arXiv Detail & Related papers (2022-08-14T16:10:14Z) - Allocating Large Vocabulary Capacity for Cross-lingual Language Model
Pre-training [59.571632468137075]
We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity.
We propose an algorithm VoCap to determine the desired vocabulary capacity of each language.
In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax.
arXiv Detail & Related papers (2021-09-15T14:04:16Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.