Recursive Speculative Decoding: Accelerating LLM Inference via Sampling
Without Replacement
- URL: http://arxiv.org/abs/2402.14160v2
- Date: Tue, 5 Mar 2024 06:55:26 GMT
- Title: Recursive Speculative Decoding: Accelerating LLM Inference via Sampling
Without Replacement
- Authors: Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee,
Christopher Lott
- Abstract summary: Speculative decoding is an inference-accel method for large language models.
Recent works have advanced this method by establishing a draft-token tree.
We present Recursive Speculative Decoding (RSD), a novel tree-based method that samples draft tokens without replacement.
- Score: 11.91629418177851
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speculative decoding is an inference-acceleration method for large language
models (LLMs) where a small language model generates a draft-token sequence
which is further verified by the target LLM in parallel. Recent works have
advanced this method by establishing a draft-token tree, achieving superior
performance over a single-sequence speculative decoding. However, those works
independently generate tokens at each level of the tree, not leveraging the
tree's entire diversifiability. Besides, their empirical superiority has been
shown for fixed length of sequences, implicitly granting more computational
resource to LLM for the tree-based methods. None of the existing works has
conducted empirical studies with fixed target computational budgets despite its
importance to resource-bounded devices. We present Recursive Speculative
Decoding (RSD), a novel tree-based method that samples draft tokens without
replacement and maximizes the diversity of the tree. During RSD's drafting, the
tree is built by either Gumbel-Top-$k$ trick that draws tokens without
replacement in parallel or Stochastic Beam Search that samples sequences
without replacement while early-truncating unlikely draft sequences and
reducing the computational cost of LLM. We empirically evaluate RSD with Llama
2 and OPT models, showing that RSD outperforms the baseline methods,
consistently for fixed draft sequence length and in most cases for fixed
computational budgets at LLM.
Related papers
- COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation [8.046705062670096]
Lossless speculative decoding accelerates target large language model inference.
We propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding) to boost speculative decoding.
Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series.
arXiv Detail & Related papers (2024-08-28T06:28:01Z) - Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling [53.58854856174773]
Speculative decoding is an approach to accelerate inference through a guess-and-verify paradigm.
Token Recycling stores candidate tokens in an adjacency matrix and employs a breadth-first search algorithm.
It significantly outperforms existing train-free methods by 30% and even a training method by 25%.
arXiv Detail & Related papers (2024-08-16T12:20:56Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - LiteSearch: Efficacious Tree Search for LLM [70.29796112457662]
This study introduces a novel guided tree search algorithm with dynamic node selection and node-level exploration budget.
Experiments conducted on the GSM8K and TabMWP datasets demonstrate that our approach enjoys significantly lower computational costs compared to baseline methods.
arXiv Detail & Related papers (2024-06-29T05:14:04Z) - Latent Logic Tree Extraction for Event Sequence Explanation from LLMs [19.90330712436838]
Modern high-stakes systems, such as healthcare or robotics, often generate vast streaming event sequences.
Our goal is to design an efficient, plug-and-play tool to elicit logic tree-based explanations from Large Language Models (LLMs) to provide customized insights into each observed event sequence.
In the online setting, our locally built, lightweight model will iteratively extract the most relevant rules from LLMs for each sequence using only a few iterations.
arXiv Detail & Related papers (2024-06-03T09:10:42Z) - ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel
Decoding [12.449023969197684]
ProPD is an efficient parallel decoding framework based on dynamic token tree pruning and generation.
We demonstrate ProPD consistently outperforms existing decoding algorithms by 1.1-3.2x.
arXiv Detail & Related papers (2024-02-21T02:51:07Z) - Tree-Planner: Efficient Close-loop Task Planning with Large Language Models [63.06270302774049]
Tree-Planner reframes task planning with Large Language Models into three distinct phases.
Tree-Planner achieves state-of-the-art performance while maintaining high efficiency.
arXiv Detail & Related papers (2023-10-12T17:59:50Z) - SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification [13.174386920965107]
SpecInfer is a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification.
The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism.
arXiv Detail & Related papers (2023-05-16T20:12:59Z) - Recursive Top-Down Production for Sentence Generation with Latent Trees [77.56794870399288]
We model the production property of context-free grammars for natural and synthetic languages.
We present a dynamic programming algorithm that marginalises over latent binary tree structures with $N$ leaves.
We also present experimental results on German-English translation on the Multi30k dataset.
arXiv Detail & Related papers (2020-10-09T17:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.