Fast and Expressive Multi-Token Prediction with Probabilistic Circuits
- URL: http://arxiv.org/abs/2511.11346v1
- Date: Fri, 14 Nov 2025 14:33:14 GMT
- Title: Fast and Expressive Multi-Token Prediction with Probabilistic Circuits
- Authors: Andreas Grivas, Lorenzo Loconte, Emile van Krieken, Piotr Nawrot, Yu Zhao, Euan Wielewski, Pasquale Minervini, Edoardo Ponti, Antonio Vergari,
- Abstract summary: Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs)<n>We investigate the trade-off between expressiveness and latency in MTP within the framework of probabilistic circuits (PCs)<n>Our framework, named MTPC, allows one to explore different ways to encode the joint distributions over future tokens.
- Score: 29.853857313543468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs), including byte-level LLMs, which are tokeniser-free but prohibitively slow. However, existing MTP methods often sacrifice expressiveness by assuming independence between future tokens. In this work, we investigate the trade-off between expressiveness and latency in MTP within the framework of probabilistic circuits (PCs). Our framework, named MTPC, allows one to explore different ways to encode the joint distributions over future tokens by selecting different circuit architectures, generalising classical models such as (hierarchical) mixture models, hidden Markov models and tensor networks. We show the efficacy of MTPC by retrofitting existing byte-level LLMs, such as EvaByte. Our experiments show that, when combined with speculative decoding, MTPC significantly speeds up generation compared to MTP with independence assumptions, while guaranteeing to retain the performance of the original verifier LLM. We also rigorously study the optimal trade-off between expressiveness and latency when exploring the possible parameterisations of MTPC, such as PC architectures and partial layer sharing between the verifier and draft LLMs.
Related papers
- Temporal Guidance for Large Language Models [22.7968403903992]
We propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Conditional Guidance (TeGu)<n>Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast.<n>Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.
arXiv Detail & Related papers (2026-01-29T14:01:00Z) - FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction [11.691960175716163]
This paper introduces FastMTP, a method that improves multi-step draft quality by aligning MTP training with its inference pattern.<n>Our approach fine-tunes a single MTP head with position-shared weights on self-distilled data, enabling it to capture dependencies among consecutive future tokens.<n> Experimental results across seven diverse benchmarks demonstrate that FastMTP achieves an average of 2.03x speedup compared to standard next token prediction.
arXiv Detail & Related papers (2025-09-16T07:36:26Z) - Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z) - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z) - L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models [95.53699156138435]
We propose leap multi-token prediction(L-MTP), an innovative token prediction method.<n>Unlike conventional MTP, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass.<n>We theoretically demonstrate the benefit of L-MTP in improving inference efficiency.
arXiv Detail & Related papers (2025-05-23T05:59:46Z) - On multi-token prediction for efficient LLM inference [0.36681882674260474]
We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities.<n>We then explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP.
arXiv Detail & Related papers (2025-02-13T15:42:44Z) - R-MTLLMF: Resilient Multi-Task Large Language Model Fusion at the Wireless Edge [78.26352952957909]
Multi-task large language models (MTLLMs) are important for many applications at the wireless edge, where users demand specialized models to handle multiple tasks efficiently.<n>The concept of model fusion via task vectors has emerged as an efficient approach for combining fine-tuning parameters to produce an MTLLM.<n>In this paper, the problem of enabling edge users to collaboratively craft such MTLMs via tasks vectors is studied, under the assumption of worst-case adversarial attacks.
arXiv Detail & Related papers (2024-11-27T10:57:06Z) - SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration [10.970637831760136]
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate LLM inference without compromising quality.<n>We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference.<n>Our experiments demonstrate that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
arXiv Detail & Related papers (2024-10-09T14:15:30Z) - MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for
Accelerating Vision-Language Transformer [66.71930982549028]
Vision-Language Transformers (VLTs) have shown great success recently, but are accompanied by heavy computation costs.
We propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs.
arXiv Detail & Related papers (2024-03-05T14:13:50Z) - SPEED: Speculative Pipelined Execution for Efficient Decoding [35.45955948053644]
We propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token.
For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized.
We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
arXiv Detail & Related papers (2023-10-18T16:07:01Z) - Towards Semantic Communication Protocols: A Probabilistic Logic
Perspective [69.68769942563812]
We propose a semantic protocol model (SPM) constructed by transforming an NPM into an interpretable symbolic graph written in the probabilistic logic programming language (ProbLog)
By leveraging its interpretability and memory-efficiency, we demonstrate several applications such as SPM reconfiguration for collision-avoidance.
arXiv Detail & Related papers (2022-07-08T14:19:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.