EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
- URL: http://arxiv.org/abs/2406.16858v2
- Date: Sun, 30 Jun 2024 15:03:25 GMT
- Title: EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
- Authors: Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang,
- Abstract summary: In this paper, we propose a new technique of context-aware dynamic draft tree into drafting modeling.
We conducted extensive evaluations on three series of Large Language Models (LLMs) and six tasks.
- Score: 25.703729145091483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly assuming that the acceptance rate of draft tokens depends only on their position. Interestingly, we found that the acceptance rate of draft tokens is also context-dependent. In this paper, building upon EAGLE, we propose EAGLE-2, which introduces a new technique of context-aware dynamic draft tree into drafting modeling. This improvement leverages the fact that the draft model of EAGLE is well-calibrated: the confidence scores from the draft model approximate acceptance rates with small errors. We conducted extensive evaluations on three series of LLMs and six tasks, with EAGLE-2 achieving speedup ratios 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. EAGLE-2 also ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm.
Related papers
- FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling [59.8051705468084]
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models.
We present FR-Spec, a frequency-ranked speculative sampling framework that optimize draft candidate selection through vocabulary space compression.
arXiv Detail & Related papers (2025-02-20T18:58:10Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.
LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.
We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration [14.011702040133848]
We propose a CTC-based draft model which strengthens the correlations between draft tokens during the draft phase.
Experiment results show that compared to strong baselines, the proposed method can achieve a higher acceptance rate and hence a faster inference speed.
arXiv Detail & Related papers (2024-11-25T14:10:21Z) - Language Model Preference Evaluation with Multiple Weak Evaluators [78.53743237977677]
GED (Preference Graph Ensemble and Denoise) is a novel approach that leverages multiple model-based evaluators to construct preference graphs.
In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process.
We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure.
arXiv Detail & Related papers (2024-10-14T01:57:25Z) - Dynamic Depth Decoding: Faster Speculative Decoding for LLMs [8.071750249796459]
We introduce Dynamic Depth Decoding (DDD), which optimises Eagle-2's tree drafting method using a dynamic depth.
This extends the average speedup that Eagle-2 achieves over Eagle by $44%$, giving DDD an average speedup of $3.16$x.
arXiv Detail & Related papers (2024-08-30T03:27:48Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting [68.90949377014742]
Speculative RAG is a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM.
Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts.
It notably enhances accuracy by up to 12.97% while reducing latency by 51% compared to conventional RAG systems on PubHealth.
arXiv Detail & Related papers (2024-07-11T06:50:19Z) - EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty [28.07947754770082]
Autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level.
The inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance.
arXiv Detail & Related papers (2024-01-26T18:59:01Z) - ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and
Effective Text Generation [97.64625999380425]
We study the text generation task under the approach of pre-trained language models (PLMs)
By leveraging the early exit technique, ELMER enables the token generations at different layers, according to their prediction confidence.
Experiments on three text generation tasks show that ELMER significantly outperforms NAR models.
arXiv Detail & Related papers (2022-10-24T14:46:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.