AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of Adaptive Draft Structures
- URL: http://arxiv.org/abs/2412.18910v1
- Date: Wed, 25 Dec 2024 13:57:33 GMT
- Title: AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of Adaptive Draft Structures
- Authors: Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, Kai Yu,
- Abstract summary: We introduce AdaEAGLE, the first SD framework that explicitly models adaptive draft structures.<n>AdaEAGLE achieves a $1.62times$ speedup over the vanilla AR decoding and outperforms fixed-length SotA baseline.
- Score: 11.436315332919245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speculative Decoding (SD) is a popular lossless technique for accelerating the inference of Large Language Models (LLMs). We show that the decoding speed of SD frameworks with static draft structures can be significantly improved by incorporating context-aware adaptive draft structures. However, current studies on adaptive draft structures are limited by their performance, modeling approaches, and applicability. In this paper, we introduce AdaEAGLE, the first SD framework that explicitly models adaptive draft structures. AdaEAGLE leverages the Lightweight Draft Length Predictor (LDLP) module to explicitly predict the optimal number of draft tokens during inference to guide the draft model. It achieves comparable speedup results without manual thresholds and allows for deeper, more specialized optimizations. Moreover, together with threshold-based strategies, AdaEAGLE achieves a $1.62\times$ speedup over the vanilla AR decoding and outperforms fixed-length SotA baseline while maintaining output quality.
Related papers
- Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z) - Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z) - CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling [60.55856973678002]
Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning.<n>Existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs.<n>We propose textbfCALM, a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks.
arXiv Detail & Related papers (2025-10-05T13:38:31Z) - FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning [11.68914161151634]
Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models.<n>We propose a speculative decoding framework that adjusts the drafting and verification strategy according to real-time levels.<n>We show that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency.
arXiv Detail & Related papers (2025-09-26T02:48:41Z) - A Novel Self-Evolution Framework for Large Language Models [18.62332474172811]
We propose a novel Dual-Phase Self-Evolution framework to jointly optimize user preference adaptation and domain-specific competence.<n>Experiments across general NLP benchmarks and long-term dialogue tasks demonstrate that DPSE consistently outperforms Supervised Fine-Tuning, Preference Optimization, and Memory-Augmented baselines.
arXiv Detail & Related papers (2025-07-21T06:30:39Z) - Mamba Drafters for Speculative Decoding [58.080550222549064]
We introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM)<n>By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods.<n>We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates.
arXiv Detail & Related papers (2025-06-01T22:52:47Z) - Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models [50.19188692497892]
Traditional alignment methods often require retraining large pretrained models.<n>We propose a novel textitResidual Alignment Model (textitRAM) that formalizes the alignment process as a type of importance sampling.<n>We develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods.
arXiv Detail & Related papers (2025-05-26T08:53:02Z) - Scalable Language Models with Posterior Inference of Latent Thought Vectors [52.63299874322121]
Latent-Thought Language Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space.
LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space.
LTMs significantly outperform conventional autoregressive models and discrete diffusion models in validation perplexity and zero-shot language modeling.
arXiv Detail & Related papers (2025-02-03T17:50:34Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)
RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.
RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Improving Multi-candidate Speculative Decoding [1.6291177798903276]
Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs)<n>In this work, we introduce a new version of MCSD that includes a target model multi-candidate generation.<n>We also evaluate the effects of using the target model multi-candidate process with different draft models on output quality.
arXiv Detail & Related papers (2024-09-16T18:20:38Z) - Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation [8.046705062670096]
Lossless speculative decoding accelerates target large language model inference.
We propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding) to boost speculative decoding.
Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series.
arXiv Detail & Related papers (2024-08-28T06:28:01Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z) - Accelerating Production LLMs with Combined Token/Embedding Speculators [4.649953910785797]
This report describes the design and training of novel speculative decoding draft models.
By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams.
arXiv Detail & Related papers (2024-04-29T21:59:07Z) - Large Language Model Agent as a Mechanical Designer [7.136205674624813]
We propose a framework that leverages a pretrained Large Language Model (LLM) in conjunction with an FEM module to autonomously generate, evaluate, and refine structural designs.
LLM operates without domain-specific fine-tuning, using general reasoning to propose design candidates, interpret FEM-derived performance metrics, and apply structurally sound modifications.
Compared to Non- Sorting Genetic Algorithm II (NSGA-II), our method achieves faster convergence and fewer FEM evaluations.
arXiv Detail & Related papers (2024-04-26T16:41:24Z) - GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative
Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding.
GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM.
We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z) - Bayesian Prompt Learning for Image-Language Model Generalization [64.50204877434878]
We use the regularization ability of Bayesian methods to frame prompt learning as a variational inference problem.
Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts.
We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space.
arXiv Detail & Related papers (2022-10-05T17:05:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.