Related papers: FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

URL: http://arxiv.org/abs/2509.18362v1
Date: Tue, 16 Sep 2025 07:36:26 GMT
Title: FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction
Authors: Yuxuan Cai, Xiaozhuan Liang, Xinghua Wang, Jin Ma, Haijin Liang, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Yuyang Yin, Xi Chen,
Abstract summary: This paper introduces FastMTP, a method that improves multi-step draft quality by aligning MTP training with its inference pattern.<n>Our approach fine-tunes a single MTP head with position-shared weights on self-distilled data, enabling it to capture dependencies among consecutive future tokens.<n> Experimental results across seven diverse benchmarks demonstrate that FastMTP achieves an average of 2.03x speedup compared to standard next token prediction.
Score: 11.691960175716163
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has demonstrated remarkable benefits for model training efficiency and performance, its inherent potential for inference acceleration remains largely unexplored. This paper introduces FastMTP, a simple yet effective method that improves multi-step draft quality by aligning MTP training with its inference pattern, significantly enhancing speculative decoding performance. Our approach fine-tunes a single MTP head with position-shared weights on self-distilled data, enabling it to capture dependencies among consecutive future tokens and maintain high acceptance rates across multiple recursive draft steps. By integrating language-aware dynamic vocabulary compression into the MTP head, we further reduce computational overhead in the drafting process. Experimental results across seven diverse benchmarks demonstrate that FastMTP achieves an average of 2.03x speedup compared to standard next token prediction with lossless output quality, outperforming vanilla MTP by 82%. FastMTP requires only lightweight training and seamlessly integrates with existing inference frameworks, offering a practical and rapidly deployable solution for accelerating LLM inference.

Related papers

Fast and Expressive Multi-Token Prediction with Probabilistic Circuits [29.853857313543468]
Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs)<n>We investigate the trade-off between expressiveness and latency in MTP within the framework of probabilistic circuits (PCs)<n>Our framework, named MTPC, allows one to explore different ways to encode the joint distributions over future tokens.
arXiv Detail & Related papers (2025-11-14T14:33:14Z)
MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z)
Pre-Training Curriculum for Multi-Token Prediction in Language Models [2.8071268036220003]
Multi-token prediction (MTP) is a recently proposed pre-training objective for language models.<n>We propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum and a reverse curriculum.
arXiv Detail & Related papers (2025-05-28T18:19:18Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models [95.53699156138435]
We propose leap multi-token prediction(L-MTP), an innovative token prediction method.<n>Unlike conventional MTP, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass.<n>We theoretically demonstrate the benefit of L-MTP in improving inference efficiency.
arXiv Detail & Related papers (2025-05-23T05:59:46Z)
Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [54.53508601749513]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework.<n>To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features.<n>Results show that TOFC achieves up to 52% reduction in data transmission overhead and 63% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z)
On multi-token prediction for efficient LLM inference [0.36681882674260474]
We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities.<n>We then explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP.
arXiv Detail & Related papers (2025-02-13T15:42:44Z)
Meaning Typed Prompting: A Technique for Efficient, Reliable Structured Output Generation [0.0]
We introduce Meaning Typed Prompting (MTP), a technique for efficient structured output generation. By utilizing expressive type definitions, MTP enhances output clarity and reduces dependence on complex abstractions. We present Semantix, a framework that implements MTP, providing practical insights into its application.
arXiv Detail & Related papers (2024-10-22T20:43:50Z)
MTP: A Meaning-Typed Language Abstraction for AI-Integrated Programming [8.768061489034642]
This paper presents Meaning-Typed Programming (MTP), a novel paradigm that automates integration through intuitive language-level constructs.<n>We implement MTP in Jac, a programming language that supersets Python, and find that MTP significantly reduces coding complexity while maintaining accuracy and efficiency.<n>Our user study shows that developers using MTP completed tasks 3.2x faster with 45% fewer lines of code compared to existing frameworks.
arXiv Detail & Related papers (2024-05-14T21:12:01Z)
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer [66.71930982549028]
Vision-Language Transformers (VLTs) have shown great success recently, but are accompanied by heavy computation costs. We propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs.
arXiv Detail & Related papers (2024-03-05T14:13:50Z)
Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications [63.29358103217275]
Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks. We propose two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after compression. We introduce a variant called Inference-time Dynamic Prompting (IDP) that can effectively increase prompt diversity without incurring any inference overhead.
arXiv Detail & Related papers (2023-10-02T03:12:06Z)
StreamYOLO: Real-time Object Detection for Streaming Perception [84.2559631820007]
We endow the models with the capacity of predicting the future, significantly improving the results for streaming perception. We consider multiple velocities driving scene and propose Velocity-awared streaming AP (VsAP) to jointly evaluate the accuracy. Our simple method achieves the state-of-the-art performance on Argoverse-HD dataset and improves the sAP and VsAP by 4.7% and 8.2% respectively.
arXiv Detail & Related papers (2022-07-21T12:03:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.