Related papers: Distribution-Aware Companding Quantization of Large Language Models

Distribution-Aware Companding Quantization of Large Language Models

URL: http://arxiv.org/abs/2603.00364v1
Date: Fri, 27 Feb 2026 23:00:54 GMT
Title: Distribution-Aware Companding Quantization of Large Language Models
Authors: Athul Radhakrishnan, Siddhant Mohan, Mahima Sachdeva,
Abstract summary: Large language models such as GPT and Llama are trained with a next-token prediction loss.<n>We suggest that training language models to predict multiple future tokens at once results in higher sample efficiency.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3X times faster at inference, even with large batch sizes.

Related papers

Multi-Token Prediction via Self-Distillation [73.81494481537636]
We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model.<n>On GSM8K, our method produces models that can decode more than $3times$ faster on average at $5%$ drop in accuracy relative to single token decoding performance.
arXiv Detail & Related papers (2026-02-05T18:54:48Z)
Pretraining Language Models to Ponder in Continuous Space [50.52734567589996]
We introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step.<n>We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations.
arXiv Detail & Related papers (2025-05-27T03:47:33Z)
Establishing Task Scaling Laws via Compute-Efficient Model Ladders [136.76316239300363]
We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting.<n>We train a set of small-scale "ladder" models, collect data points to fit the parameterized functions of the two prediction steps, and make predictions for two target models.<n>On four multiple-choice tasks formatted as ranked classification, we can predict the accuracy of both target models within 2 points of absolute error.
arXiv Detail & Related papers (2024-12-05T18:21:49Z)
Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy.<n>By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z)
Better & Faster Large Language Models via Multi-token Prediction [29.067271500844928]
Large language models such as GPT and Llama are trained with a next-token prediction loss. We suggest that training language models to predict multiple future tokens at once results in higher sample efficiency.
arXiv Detail & Related papers (2024-04-30T17:33:57Z)
Rho-1: Not All Tokens Are What You Need [132.31428897792114]
Previous language model pre-training methods uniformly applied a next-token prediction loss to all training tokens.<n>Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that align with desired distribution.<n>When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks.
arXiv Detail & Related papers (2024-04-11T17:52:01Z)
Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs. However, there remain gaps between current studies and how language models are trained. In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z)
MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models [40.992566245706996]
We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens. We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters. Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
arXiv Detail & Related papers (2023-10-30T13:33:21Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.