The Disparate Impacts of Speculative Decoding
- URL: http://arxiv.org/abs/2510.02128v1
- Date: Thu, 02 Oct 2025 15:38:57 GMT
- Title: The Disparate Impacts of Speculative Decoding
- Authors: Jameson Sandler, Ahmet Üstün, Marco Romanelli, Sara Hooker, Ferdinando Fioretto,
- Abstract summary: speculative decoding is a technique for systematically reducing the decoding time of large language models.<n>The paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks.
- Score: 54.98795989404752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.
Related papers
- Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization [56.59356959631999]
Gated Perception-Reasoning Optimization (GPRO) is a meta-reasoning controller that dynamically routes computation among three decision paths.<n>GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods.
arXiv Detail & Related papers (2026-01-07T23:05:17Z) - Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models [29.56923793047279]
We introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens.<n>DOT targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities.<n>Our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy.
arXiv Detail & Related papers (2026-01-07T14:31:07Z) - From Fake Focus to Real Precision: Confusion-Driven Adversarial Attention Learning in Transformers [0.0]
Transformer-based models have been widely adopted for sentiment analysis tasks due to their exceptional ability to capture contextual information.<n>We observe that existing models tend to allocate attention primarily to common words, overlooking less popular yet highly task-relevant terms.<n>We propose an Adversarial Feedback for Attention(AFA) training mechanism that enables the model to automatically redistribute attention weights to appropriate focal points.
arXiv Detail & Related papers (2025-12-19T01:48:25Z) - Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z) - A Theoretical Perspective for Speculative Decoding Algorithm [60.79447486066416]
One effective way to accelerate inference is emphSpeculative Decoding, which employs a small model to sample a sequence of draft tokens and a large model to validate.
This paper tackles this gap by conceptualizing the decoding problem via markov chain abstraction and studying the key properties, emphoutput quality and inference acceleration, from a theoretical perspective.
arXiv Detail & Related papers (2024-10-30T01:53:04Z) - Investigating the Impact of Quantization on Adversarial Robustness [22.637585106574722]
Quantization is a technique for reducing the bit-width of deep models to improve their runtime performance and storage efficiency.
In real-world scenarios, quantized models are often faced with adversarial attacks which cause the model to make incorrect inferences.
We conduct a first-time analysis of the impact of the quantization pipeline components that can incorporate robust optimization.
arXiv Detail & Related papers (2024-04-08T16:20:15Z) - QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning [52.157939524815866]
In this paper, we identify imbalanced activation distributions as a primary source of quantization difficulty.<n>We propose to adjust these distributions through weight finetuning to be more quantization-friendly.<n>Our method demonstrates its efficacy across three high-resolution image generation tasks.
arXiv Detail & Related papers (2024-02-06T03:39:44Z) - Hybrid Predictive Coding: Inferring, Fast and Slow [62.997667081978825]
We propose a hybrid predictive coding network that combines both iterative and amortized inference in a principled manner.
We demonstrate that our model is inherently sensitive to its uncertainty and adaptively balances balances to obtain accurate beliefs using minimum computational expense.
arXiv Detail & Related papers (2022-04-05T12:52:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.