Related papers: RASD: Retrieval-Augmented Speculative Decoding

RASD: Retrieval-Augmented Speculative Decoding

URL: http://arxiv.org/abs/2503.03434v1
Date: Wed, 05 Mar 2025 12:10:14 GMT
Title: RASD: Retrieval-Augmented Speculative Decoding
Authors: Guofeng Quan, Wenfeng Feng, Chuzhan Hao, Guochao Jiang, Yuewei Zhang, Hao Wang,
Abstract summary: Speculative decoding accelerates inference in large language models (LLMs)<n>This paper proposes RASD (Retrieval-Augmented Speculative Decoding), which adopts retrieval methods to enhance model-based speculative decoding.
Score: 5.3926068062773895
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding accelerates inference in large language models (LLMs) by generating draft tokens for target model verification. Current approaches for obtaining draft tokens rely on lightweight draft models or additional model structures to generate draft tokens and retrieve context from databases. Due to the draft model's small size and limited training data, model-based speculative decoding frequently becomes less effective in out-of-domain scenarios. Additionally, the time cost of the drafting phase results in a low upper limit on acceptance length during the verification step, limiting overall efficiency. This paper proposes RASD (Retrieval-Augmented Speculative Decoding), which adopts retrieval methods to enhance model-based speculative decoding. We introduce tree pruning and tree fusion to achieve this. Specifically, we develop a pruning method based on the draft model's probability distribution to construct the optimal retrieval tree. Second, we employ the longest prefix matching algorithm to merge the tree generated by the draft model with the retrieval tree, resulting in a unified tree for verification. Experimental results demonstrate that RASD achieves state-of-the-art inference acceleration across tasks such as DocQA, Summary, Code, and In-Domain QA. Moreover, RASD exhibits strong scalability, seamlessly integrating with various speculative decoding approaches, including both generation-based and retrieval-based methods.

Related papers

Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding [35.984745508100595]
We present a systematic evaluation of verification strategies across model families, tasks, and sampling regimes.<n>Traversal Verification dominates consistently, with OT-based methods lagging far behind.<n>We propose delayed tree expansion, which drafts a partial single path, delaying the i.i.d. branching point.
arXiv Detail & Related papers (2026-02-19T01:41:58Z)
Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z)
CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference [19.14564724894706]
We propose a speculative decoding framework employing a 'query-and-correct' paradigm.<n> CARD decouples drafting and verification: the draft model generates candidate tokens to populate a shared cache, while the target model concurrently rectifies the draft model's generation direction.<n>Our approach achieves up to 4.83 speedup over vanilla decoding without requiring fine-tuning of either the draft or target models.
arXiv Detail & Related papers (2025-08-06T14:02:10Z)
Traversal Verification for Speculative Tree Decoding [9.534492618180085]
Speculative decoding is a promising approach for accelerating large language models.<n>This paper introduces Traversal Verification, a novel speculative decoding algorithm.<n>We show that our method consistently improves acceptance length and throughput over existing methods.
arXiv Detail & Related papers (2025-05-18T12:51:55Z)
C2T: A Classifier-Based Tree Construction Method in Speculative Decoding [9.663330370149428]
Speculative decoding methods often face inefficiencies in the construction of token trees and the verification of candidate tokens. We propose a novel method named C2T that adopts a lightweight classifier to generate and prune token trees dynamically.
arXiv Detail & Related papers (2025-02-19T11:57:02Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)
Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models. We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure [40.9990864658776]
Speculative decoding employs a "draft and then verify" mechanism to allow multiple tokens to be generated in one step.<n>Existing methods mainly adopt fixed draft structures, which fail to adapt to different situations.<n>We propose OPT-Tree, an algorithm to construct adaptive and scalable draft trees.
arXiv Detail & Related papers (2024-06-25T04:45:53Z)
Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement [11.91629418177851]
Speculative decoding is an inference-accel method for large language models. Recent works have advanced this method by establishing a draft-token tree. We present Recursive Speculative Decoding (RSD), a novel tree-based method that samples draft tokens without replacement.
arXiv Detail & Related papers (2024-02-21T22:57:49Z)
Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model. This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification. We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z)
DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models. We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn. Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z)
Entailment Tree Explanations via Iterative Retrieval-Generation Reasoner [56.08919422452905]
We propose an architecture called Iterative Retrieval-Generation Reasoner (IRGR) Our model is able to explain a given hypothesis by systematically generating a step-by-step explanation from textual premises. We outperform existing benchmarks on premise retrieval and entailment tree generation, with around 300% gain in overall correctness.
arXiv Detail & Related papers (2022-05-18T21:52:11Z)
Complex Event Forecasting with Prediction Suffix Trees: Extended Technical Report [70.7321040534471]
Complex Event Recognition (CER) systems have become popular in the past two decades due to their ability to "instantly" detect patterns on real-time streams of events. There is a lack of methods for forecasting when a pattern might occur before such an occurrence is actually detected by a CER engine. We present a formal framework that attempts to address the issue of Complex Event Forecasting.
arXiv Detail & Related papers (2021-09-01T09:52:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.