Multi-Token Prediction via Self-Distillation
- URL: http://arxiv.org/abs/2602.06019v1
- Date: Thu, 05 Feb 2026 18:54:48 GMT
- Title: Multi-Token Prediction via Self-Distillation
- Authors: John Kirchenbauer, Abhimanyu Hans, Brian Bartoldson, Micah Goldblum, Ashwinee Panda, Tom Goldstein,
- Abstract summary: We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model.<n>On GSM8K, our method produces models that can decode more than $3times$ faster on average at $5%$ drop in accuracy relative to single token decoding performance.
- Score: 73.81494481537636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. On GSM8K, our method produces models that can decode more than $3\times$ faster on average at $<5\%$ drop in accuracy relative to single token decoding performance.
Related papers
- Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z) - Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production [55.76222360698305]
We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token.<n>For any token, the model can request additional compute steps by emitting a don't know> output.<n>We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context.
arXiv Detail & Related papers (2025-10-13T21:07:05Z) - Pretraining Language Models to Ponder in Continuous Space [50.52734567589996]
We introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step.<n>We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations.
arXiv Detail & Related papers (2025-05-27T03:47:33Z) - Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE [15.003006630308517]
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens.<n>We propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions.<n>Our method significantly boosts prediction accuracy and achieves higher inference speedups.
arXiv Detail & Related papers (2025-02-10T09:24:06Z) - The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation [48.52206677611072]
Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model.
We show that combinations of simple strategies can achieve significant inference speedups over different tasks.
arXiv Detail & Related papers (2024-11-06T09:23:50Z) - Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy.<n>By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z) - Efficient Training of Language Models with Compact and Consistent Next Token Distributions [23.312920633391837]
We show that we can train better models faster by pre-aggregating the corpus with a collapsed $n$-gram distribution.
Our approximation facilitates scalability of gains to larger datasets and models.
arXiv Detail & Related papers (2024-07-03T05:40:41Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - Learning to Faithfully Rationalize by Construction [36.572594249534866]
In many settings it is important to be able to understand why a model made a particular prediction.
We propose a simpler variant of this approach that provides faithful explanations by construction.
In both automatic and manual evaluations we find that variants of this simple framework yield superior to end-to-end' approaches.
arXiv Detail & Related papers (2020-04-30T21:45:40Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.