Related papers: SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

Related papers

Residual Context Diffusion Language Models [90.07635240595926]
Residual Context Diffusion (RCD) is a module that converts discarded token representations into contextual residuals and injects them back for the next denoising step.<n>RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead.
arXiv Detail & Related papers (2026-01-30T13:16:32Z)
Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z)
dParallel: Learnable Parallel Decoding for dLLMs [77.24184219948337]
Diffusion large language models (dLLMs) offer parallel token prediction and lower inference latency.<n>Existing open-source models still require nearly token-length decoding steps to ensure performance.<n>We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling.
arXiv Detail & Related papers (2025-09-30T16:32:52Z)
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [60.37610817226533]
Chain-of-thought (CoT) reasoning encourages step-by-step intermediate reasoning during inference.<n>CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences.<n>We present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
Accelerating Diffusion LLMs via Adaptive Parallel Decoding [50.9948753314669]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z)
Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding [55.2480439325792]
In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution. We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution. We show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation.
arXiv Detail & Related papers (2025-04-29T06:33:13Z)
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources. This paper formulates LLM inference optimization as a multi-stage online scheduling problem. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive. LCD can distort the global distribution over strings, sampling tokens based only on local information. We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
ConSol: Sequential Probability Ratio Testing to Find Consistent LLM Reasoning Paths Efficiently [3.6393221632527686]
Small language models (LLMs) solve complex tasks by generating intermediate reasoning steps prior to providing answers. The widely-used self-consistency method further exacerbates these costs by aggregating multiple reasoning paths to improve accuracy. We propose leveraging Sequential Probability Ratio Testing (SPRT) to dynamically terminate sampling once sufficient consistency is achieved.
arXiv Detail & Related papers (2025-03-22T00:07:28Z)
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling [59.8051705468084]
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models. We present FR-Spec, a frequency-ranked speculative sampling framework that optimize draft candidate selection through vocabulary space compression.
arXiv Detail & Related papers (2025-02-20T18:58:10Z)
Speeding up Speculative Decoding via Approximate Verification [7.754712828900729]
Speculative Decoding (SD) is a recently proposed technique for faster inference using Large Language Models (LLMs) We propose SPRINTER, which utilizes a low-complexity verifier trained to predict if tokens generated from a draft LLM would be accepted by the target LLM. We present a theoretical analysis of SPRINTER, examining the statistical properties of the generated tokens, as well as the expected reduction in latency.
arXiv Detail & Related papers (2025-02-06T23:10:53Z)
SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications [9.143856130336783]
Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference.<n>Agentic frameworks submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs.<n>We introduce emphSuffixDecoding, a novel method that utilizes efficient suffix trees to cache long token sequences.
arXiv Detail & Related papers (2024-11-07T18:49:33Z)
Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z)
Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated. We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z)
Sparsity-Constraint Optimization via Splicing Iteration [1.3622424109977902]
We develop an algorithm named Sparsity-Constraint Optimization via sPlicing itEration (SCOPE) SCOPE converges effectively without tuning parameters. We apply SCOPE to solve quadratic optimization, learn sparse classifiers, and recover sparse Markov networks for binary variables. Our open-source Python package skscope based on C++ implementation is publicly available on GitHub.
arXiv Detail & Related papers (2024-06-17T18:34:51Z)
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z)
Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement [11.91629418177851]
Speculative decoding is an inference-accel method for large language models. Recent works have advanced this method by establishing a draft-token tree. We present Recursive Speculative Decoding (RSD), a novel tree-based method that samples draft tokens without replacement.
arXiv Detail & Related papers (2024-02-21T22:57:49Z)
Linear-time Minimum Bayes Risk Decoding with Reference Aggregation [52.1701152610258]
Minimum Bayes Risk (MBR) decoding is a text generation technique that has been shown to improve the quality of machine translations. It requires the pairwise calculation of a utility metric, which has quadratic complexity. We propose to approximate pairwise metric scores with scores calculated against aggregated reference representations.
arXiv Detail & Related papers (2024-02-06T18:59:30Z)
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z)
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer. It accurately predicts the number of output tokens and extract hidden variables. It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.