On multi-token prediction for efficient LLM inference
- URL: http://arxiv.org/abs/2502.09419v1
- Date: Thu, 13 Feb 2025 15:42:44 GMT
- Title: On multi-token prediction for efficient LLM inference
- Authors: Somesh Mehra, Javier Alonso Garcia, Lukas Mauch,
- Abstract summary: We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities.<n>We then explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP.
- Score: 0.36681882674260474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.
Related papers
- Injecting Imbalance Sensitivity for Multi-Task Learning [36.60453299563175]
Multi-task learning (MTL) has emerged as a promising approach for deploying deep learning models in real-life applications.
Recent studies have proposed optimization-based learning paradigms to establish task-shared representations in MTL.
Our paper empirically argues that these studies primarily emphasize the conflict issue while neglecting the potentially more significant impact of imbalance/dominance in MTL.
arXiv Detail & Related papers (2025-03-11T03:11:54Z) - Reasoning Bias of Next Token Prediction Training [5.188841610098436]
Next token prediction (NTP) is the dominant training paradigm for Large Language Models (LLMs)<n>We show that despite NTP's exposure to noise during training, it surpasses in reasoning ability.<n>We attribute this counterintuitive outcome to the regularizing influence of noise on the training dynamics.
arXiv Detail & Related papers (2025-02-04T04:46:41Z) - Understanding Chain-of-Thought in LLMs through Information Theory [16.78730663293352]
We formalize Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) through an information-theoretic lens.
Specifically, our framework quantifies the information gain' at each reasoning step, enabling the identification of failure modes.
We demonstrate the efficacy of our approach through extensive experiments on toy and GSM-8K data, where it significantly outperforms existing outcome-based methods.
arXiv Detail & Related papers (2024-11-18T19:14:36Z) - Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe [30.03925858123481]
We propose SFTMix, a novel recipe that elevates instruction-tuning performance beyond the conventional NTP paradigm.
Based on training dynamics, we argue that examples with different confidence levels should play distinct roles during the instruction-tuning process.
This approach enables SFTMix to significantly outperform NTP across a wide range of instruction-following and healthcare domain-specific SFT tasks.
arXiv Detail & Related papers (2024-10-07T17:52:21Z) - NDP: Next Distribution Prediction as a More Broad Target [59.30497395313209]
We introduce Next Distribution Prediction (NDP), which uses $n$-gram distributions to replace the one-hot targets.
NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain.
arXiv Detail & Related papers (2024-08-30T16:13:49Z) - Multi-Epoch learning with Data Augmentation for Deep Click-Through Rate Prediction [53.88231294380083]
We introduce a novel Multi-Epoch learning with Data Augmentation (MEDA) framework, suitable for both non-continual and continual learning scenarios.
MEDA minimizes overfitting by reducing the dependency of the embedding layer on subsequent training data.
Our findings confirm that pre-trained layers can adapt to new embedding spaces, enhancing performance without overfitting.
arXiv Detail & Related papers (2024-06-27T04:00:15Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks.
Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs.
We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z) - The Gaps between Pre-train and Downstream Settings in Bias Evaluation
and Debiasing [74.7319697510621]
In-Context Learning (ICL) induces smaller changes to PLMs compared to FT-based debiasing methods.
ICL-based debiasing methods show a higher correlation between intrinsic and extrinsic bias scores compared to FT-based methods.
arXiv Detail & Related papers (2024-01-16T17:15:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.