EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
- URL: http://arxiv.org/abs/2406.16858v2
- Date: Sun, 30 Jun 2024 15:03:25 GMT
- Title: EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
- Authors: Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang,
- Abstract summary: In this paper, we propose a new technique of context-aware dynamic draft tree into drafting modeling.
We conducted extensive evaluations on three series of Large Language Models (LLMs) and six tasks.
- Score: 25.703729145091483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly assuming that the acceptance rate of draft tokens depends only on their position. Interestingly, we found that the acceptance rate of draft tokens is also context-dependent. In this paper, building upon EAGLE, we propose EAGLE-2, which introduces a new technique of context-aware dynamic draft tree into drafting modeling. This improvement leverages the fact that the draft model of EAGLE is well-calibrated: the confidence scores from the draft model approximate acceptance rates with small errors. We conducted extensive evaluations on three series of LLMs and six tasks, with EAGLE-2 achieving speedup ratios 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. EAGLE-2 also ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm.
Related papers
- RASD: Retrieval-Augmented Speculative Decoding [5.3926068062773895]
Speculative decoding accelerates inference in large language models (LLMs)
This paper proposes RASD (Retrieval-Augmented Speculative Decoding), which adopts retrieval methods to enhance model-based speculative decoding.
arXiv Detail & Related papers (2025-03-05T12:10:14Z) - EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test [25.703729145091483]
A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs.
EAGLE-3 abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion.
These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data.
arXiv Detail & Related papers (2025-03-03T18:59:04Z) - CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter [9.631036588583248]
Speculative decoding is a powerful technique that accelerates Large Language Model (LLM) inference by leveraging a lightweight speculative draft model.
Recent methods have tried to solve this issue by adopting a multi-step training strategy, but the complex inputs of different training steps make it harder for the draft model to converge.
We propose CORAL, a novel framework that improves both accuracy and efficiency in speculative drafting.
arXiv Detail & Related papers (2025-02-24T06:28:26Z) - FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling [59.8051705468084]
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models.
We present FR-Spec, a frequency-ranked speculative sampling framework that optimize draft candidate selection through vocabulary space compression.
arXiv Detail & Related papers (2025-02-20T18:58:10Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.
LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.
We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - AIC CTU system at AVeriTeC: Re-framing automated fact-checking as a simple RAG task [0.0]
This paper describes our solution to the challenge of fact-checking with evidence retrieved in the wild using a simple scheme of Retrieval-Augmented Generation (RAG)
We release our and explain its two modules - the Retriever and the Evidence & Label generator - in detail, justifying their features such as MMR-reranking and Likert-scale confidence estimation.
We perform an empirical error analysis to see that faults in our predictions often coincide with noise in the data or ambiguous fact-checks, provoking further research and data augmentation.
arXiv Detail & Related papers (2024-10-15T09:50:19Z) - Language Model Preference Evaluation with Multiple Weak Evaluators [78.53743237977677]
GED (Preference Graph Ensemble and Denoise) is a novel approach that leverages multiple model-based evaluators to construct preference graphs.
In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process.
We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure.
arXiv Detail & Related papers (2024-10-14T01:57:25Z) - Dynamic Depth Decoding: Faster Speculative Decoding for LLMs [8.071750249796459]
We introduce Dynamic Depth Decoding (DDD), which optimises Eagle-2's tree drafting method using a dynamic depth.
This extends the average speedup that Eagle-2 achieves over Eagle by $44%$, giving DDD an average speedup of $3.16$x.
arXiv Detail & Related papers (2024-08-30T03:27:48Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting [68.90949377014742]
Speculative RAG is a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM.
Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts.
It notably enhances accuracy by up to 12.97% while reducing latency by 51% compared to conventional RAG systems on PubHealth.
arXiv Detail & Related papers (2024-07-11T06:50:19Z) - GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative
Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding.
GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM.
We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z) - EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty [28.07947754770082]
Autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level.
The inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance.
arXiv Detail & Related papers (2024-01-26T18:59:01Z) - BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs'
Generation [60.77990074569754]
We present a computation-efficient framework that steers a frozen Pre-Trained Language Model towards more commonsensical generation.
Specifically, we first construct a reference-free evaluator that assigns a sentence with a commonsensical score.
We then use the scorer as the oracle for commonsense knowledge, and extend the controllable generation method called NADO to train an auxiliary head.
arXiv Detail & Related papers (2023-10-25T23:32:12Z) - ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and
Effective Text Generation [97.64625999380425]
We study the text generation task under the approach of pre-trained language models (PLMs)
By leveraging the early exit technique, ELMER enables the token generations at different layers, according to their prediction confidence.
Experiments on three text generation tasks show that ELMER significantly outperforms NAR models.
arXiv Detail & Related papers (2022-10-24T14:46:47Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.