Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential
- URL: http://arxiv.org/abs/2507.11851v1
- Date: Wed, 16 Jul 2025 02:31:40 GMT
- Title: Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential
- Authors: Mohammad Samragh, Arnav Kundu, David Harrison, Kumari Nishu, Devang Naik, Minsik Cho, Mehrdad Farajtabar,
- Abstract summary: We propose a novel framework that leverages the inherent knowledge of vanilla autoregressive language models about future tokens.<n>Our method achieves significant speedups through supervised fine-tuning on pretrained models.
- Score: 12.719829360337833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive language models are constrained by their inherently sequential nature, generating one token at a time. This paradigm limits inference speed and parallelism, especially during later stages of generation when the direction and semantics of text are relatively certain. In this work, we propose a novel framework that leverages the inherent knowledge of vanilla autoregressive language models about future tokens, combining techniques to realize this potential and enable simultaneous prediction of multiple subsequent tokens. Our approach introduces several key innovations: (1) a masked-input formulation where multiple future tokens are jointly predicted from a common prefix; (2) a gated LoRA formulation that preserves the original LLM's functionality, while equipping it for multi-token prediction; (3) a lightweight, learnable sampler module that generates coherent sequences from the predicted future tokens; (4) a set of auxiliary training losses, including a consistency loss, to enhance the coherence and accuracy of jointly generated tokens; and (5) a speculative generation strategy that expands tokens quadratically in the future while maintaining high fidelity. Our method achieves significant speedups through supervised fine-tuning on pretrained models. For example, it generates code and math nearly 5x faster, and improves general chat and knowledge tasks by almost 2.5x. These gains come without any loss in quality.
Related papers
- Improving Large Language Models with Concept-Aware Fine-Tuning [55.59287380665864]
Concept-Aware Fine-Tuning (CAFT) is a novel multi-token training method for large language models (LLMs)<n>CAFT enables the learning of sequences that span multiple tokens, fostering stronger concept-aware learning.<n>Experiments demonstrate significant improvements compared to conventional next-token finetuning methods.
arXiv Detail & Related papers (2025-06-09T14:55:00Z) - Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [63.89280381800457]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z) - Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding [11.07450742824775]
Speculative decoding aims to accelerate the auto-regressive token generation process of a target Large Language Model.<n>Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence.<n>We propose Gumiho, a hybrid model combining serial and parallel heads.
arXiv Detail & Related papers (2025-03-13T07:55:38Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy.<n>By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z) - Emu3: Next-Token Prediction is All You Need [45.142268281651035]
We introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction.
Emu3 outperforms several well-established task-specific models in both generation and perception tasks.
It is also capable of generating high-fidelity video via predicting the next token in a video sequence.
arXiv Detail & Related papers (2024-09-27T16:06:11Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.