On multi-token prediction for efficient LLM inference
- URL: http://arxiv.org/abs/2502.09419v1
- Date: Thu, 13 Feb 2025 15:42:44 GMT
- Title: On multi-token prediction for efficient LLM inference
- Authors: Somesh Mehra, Javier Alonso Garcia, Lukas Mauch,
- Abstract summary: We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities.<n>We then explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP.
- Score: 0.36681882674260474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.
Related papers
- No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs [65.783709850324]
This work stems from prior complementary observations on the dynamics of Chain-of-Thought (CoT): Large Language Models (LLMs)<n>LLMs are shown latent planning of subsequent reasoning prior to CoT emergence, thereby diminishing the significance of explicit CoT.<n>We investigate the latent planning strength of LLMs, through our probing method, Tele-Lens, applying to hidden states across diverse task domains.
arXiv Detail & Related papers (2026-02-02T13:46:56Z) - Fast and Expressive Multi-Token Prediction with Probabilistic Circuits [29.853857313543468]
Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs)<n>We investigate the trade-off between expressiveness and latency in MTP within the framework of probabilistic circuits (PCs)<n>Our framework, named MTPC, allows one to explore different ways to encode the joint distributions over future tokens.
arXiv Detail & Related papers (2025-11-14T14:33:14Z) - FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction [11.691960175716163]
This paper introduces FastMTP, a method that improves multi-step draft quality by aligning MTP training with its inference pattern.<n>Our approach fine-tunes a single MTP head with position-shared weights on self-distilled data, enabling it to capture dependencies among consecutive future tokens.<n> Experimental results across seven diverse benchmarks demonstrate that FastMTP achieves an average of 2.03x speedup compared to standard next token prediction.
arXiv Detail & Related papers (2025-09-16T07:36:26Z) - Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs [57.82819770709032]
Large language models (LLMs) can be effective context-aided forecasters via na"ive direct prompting.<n>ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model's reasoning over the context.<n>CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines.<n> IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models.
arXiv Detail & Related papers (2025-08-13T16:02:55Z) - Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - Mitigating Spurious Correlations in LLMs via Causality-Aware Post-Training [57.03005244917803]
Large language models (LLMs) often fail on out-of-distribution (OOD) samples due to spurious correlations acquired during pre-training.<n>Here, we aim to mitigate such spurious correlations through causality-aware post-training (CAPT)<n> Experiments on the formal causal inference benchmark CLadder and the logical reasoning dataset PrOntoQA show that 3B-scale language models fine-tuned with CAPT can outperform both traditional SFT and larger LLMs on in-distribution (ID) and OOD tasks.
arXiv Detail & Related papers (2025-06-11T06:30:28Z) - Pre-Training Curriculum for Multi-Token Prediction in Language Models [2.8071268036220003]
Multi-token prediction (MTP) is a recently proposed pre-training objective for language models.<n>We propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum and a reverse curriculum.
arXiv Detail & Related papers (2025-05-28T18:19:18Z) - L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models [69.1271366892683]
We propose leap multi-token prediction(L-MTP), an innovative token prediction method.<n>Unlike conventional MTP, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass.<n>We theoretically demonstrate the benefit of L-MTP in improving inference efficiency.
arXiv Detail & Related papers (2025-05-23T05:59:46Z) - Injecting Imbalance Sensitivity for Multi-Task Learning [36.60453299563175]
Multi-task learning (MTL) has emerged as a promising approach for deploying deep learning models in real-life applications.
Recent studies have proposed optimization-based learning paradigms to establish task-shared representations in MTL.
Our paper empirically argues that these studies primarily emphasize the conflict issue while neglecting the potentially more significant impact of imbalance/dominance in MTL.
arXiv Detail & Related papers (2025-03-11T03:11:54Z) - Reasoning Bias of Next Token Prediction Training [5.188841610098436]
Next token prediction (NTP) is the dominant training paradigm for Large Language Models (LLMs)<n>We show that despite NTP's exposure to noise during training, it surpasses in reasoning ability.<n>We attribute this counterintuitive outcome to the regularizing influence of noise on the training dynamics.
arXiv Detail & Related papers (2025-02-04T04:46:41Z) - Understanding Chain-of-Thought in LLMs through Information Theory [16.78730663293352]
We formalize Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) through an information-theoretic lens.
Specifically, our framework quantifies the information gain' at each reasoning step, enabling the identification of failure modes.
We demonstrate the efficacy of our approach through extensive experiments on toy and GSM-8K data, where it significantly outperforms existing outcome-based methods.
arXiv Detail & Related papers (2024-11-18T19:14:36Z) - Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe [30.03925858123481]
We propose SFTMix, a novel recipe that elevates instruction-tuning performance beyond the conventional NTP paradigm.
Based on training dynamics, we argue that examples with different confidence levels should play distinct roles during the instruction-tuning process.
This approach enables SFTMix to significantly outperform NTP across a wide range of instruction-following and healthcare domain-specific SFT tasks.
arXiv Detail & Related papers (2024-10-07T17:52:21Z) - NDP: Next Distribution Prediction as a More Broad Target [59.30497395313209]
We introduce Next Distribution Prediction (NDP), which uses $n$-gram distributions to replace the one-hot targets.
NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain.
arXiv Detail & Related papers (2024-08-30T16:13:49Z) - Multi-Epoch learning with Data Augmentation for Deep Click-Through Rate Prediction [53.88231294380083]
We introduce a novel Multi-Epoch learning with Data Augmentation (MEDA) framework, suitable for both non-continual and continual learning scenarios.
MEDA minimizes overfitting by reducing the dependency of the embedding layer on subsequent training data.
Our findings confirm that pre-trained layers can adapt to new embedding spaces, enhancing performance without overfitting.
arXiv Detail & Related papers (2024-06-27T04:00:15Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks.
Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs.
We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z) - The Gaps between Pre-train and Downstream Settings in Bias Evaluation
and Debiasing [74.7319697510621]
In-Context Learning (ICL) induces smaller changes to PLMs compared to FT-based debiasing methods.
ICL-based debiasing methods show a higher correlation between intrinsic and extrinsic bias scores compared to FT-based methods.
arXiv Detail & Related papers (2024-01-16T17:15:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.