Related papers: READER: Retrieval-Assisted Drafter for Efficient LLM Inference

READER: Retrieval-Assisted Drafter for Efficient LLM Inference

URL: http://arxiv.org/abs/2508.09072v2
Date: Sat, 27 Sep 2025 20:13:25 GMT
Title: READER: Retrieval-Assisted Drafter for Efficient LLM Inference
Authors: Maxim Divilkovskiy, Vitaly Malygin, Sergey Zlobin, Stanislav Ilyushin, Sultan Isali, Vasily Kalugin, Nuriza Aitassova, Fei Yi, Weidi Zeng,
Abstract summary: Autoregressive Language Models instantiate a factorized likelihood over token sequences, yet their strictly sequential decoding process imposes an intrinsic lower bound on latency inference.<n>This bottleneck has emerged as a central obstacle to the scalable deployment of large-scale generative models.<n>We present READER, a speculative decoding framework that bypasses the training of the auxiliary draft model.
Score: 0.0386965802948046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive Language Models instantiate a factorized likelihood over token sequences, yet their strictly sequential decoding process imposes an intrinsic lower bound on inference latency. This bottleneck has emerged as a central obstacle to the scalable deployment of large-scale generative models. Existing acceleration techniques partially mitigate token-level latency by relying on auxiliary draft models or introducing an additional training phase, but fail to address the dominant memory and communication costs. We present READER, a provably lossless speculative decoding framework that bypasses the training of the auxiliary draft model. READER formalizes speculative decoding as a stochastic tree construction problem and exploits the empirical redundancy structure of natural language to generate high-probability candidate continuations. Our method revisits the problem of constructing draft trees, establishing substantial statistical improvements over stochastic draft-tree methods and providing a complexity-theoretic analysis that characterizes the optimality frontier of speculative decoding under bounded computation and memory resources. Beyond the single-sequence regime traditionally considered in prior work, we introduce a memory-optimal key-value cache-serving strategy that guarantees amortized sublinear overhead in the batch dimension, allowing READER to scale to realistic inference workloads. Comprehensive experiments demonstrate up to 6.13x wall-clock speedup on single-prompt inference and up to 5.92x on batched inference, consistently surpassing prior speculative decoding baselines, while preserving exact output equivalence, with even more pronounced gains in retrieval-augmented generation pipelines. Our results close a key gap between theoretical parallelism limits and practical LLM inference, suggesting a new standard for efficient deployment.

Related papers

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes [10.877713536966601]
Longestahead Prefix (LSP) scheduler is a training-free and model-agnostic inference paradigm based on monolithic prefix absorption.<n>LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions.<n>It snaps its boundary to natural linguistic or structural acceptances before an atomic commitment.
arXiv Detail & Related papers (2026-03-05T18:25:26Z)
Latent-Space Contrastive Reinforcement Learning for Stable and Efficient LLM Reasoning [16.244366307890832]
We propose textbfDeepLatent Reasoning (DLR), a latent-space bidirectional contrastive reinforcement learning framework.<n>This framework shifts the trial-and-error cost from expensive token-level full sequence generation to the continuous latent manifold.<n> Experiments demonstrate that DLR achieves more stable training convergence, supports longer-horizon reasoning chains, and facilitates the sustainable accumulation of reasoning capabilities.
arXiv Detail & Related papers (2026-01-24T03:18:22Z)
Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z)
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective [85.06838178922791]
Reinforcement Learning (RL) has proven highly effective for autoregressive language models.<n>But adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges.<n>We propose a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy.
arXiv Detail & Related papers (2025-12-03T13:05:32Z)
Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z)
Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z)
Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z)
Hierarchical Verification of Speculative Beams for Accelerating LLM Inference [0.0]
Hierarchical Verification Tree (HVT) is a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts.<n>HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption.<n>Findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference.
arXiv Detail & Related papers (2025-07-30T02:58:03Z)
Inference Acceleration of Autoregressive Normalizing Flows by Selective Jacobi Decoding [12.338918067455436]
Normalizing flows are promising generative models with advantages such as theoretical rigor, analytical log-likelihood, and end-to-end training.<n>Recent advancements utilize autoregressive modeling, significantly enhancing expressive power and generation quality.<n>We propose a selective Jacobi decoding (SeJD) strategy that accelerates autoregressive inference through parallel iterative optimization.
arXiv Detail & Related papers (2025-05-30T16:53:15Z)
Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z)
Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
Generative retrieval reformulates retrieval as an autoregressive generation task, where large language models generate target documents directly from a query.<n>We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance.
arXiv Detail & Related papers (2025-03-24T17:59:03Z)
TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding [76.23719557942917]
TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel.<n>We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens.
arXiv Detail & Related papers (2025-02-21T04:19:24Z)
Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding [1.3479499607624648]
Speculative decoding addresses bottleneck by introducing a two-stage framework: drafting and verification.<n>A smaller, efficient model generates a preliminary draft, which is then refined by a larger, more sophisticated model.<n>This paper provides a comprehensive survey of speculative decoding methods, categorizing them into draft-centric and model-centric approaches.
arXiv Detail & Related papers (2024-11-20T09:46:30Z)
QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries.<n>We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks.<n> Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z)
Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models. We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated. We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.