Dynamic Depth Decoding: Faster Speculative Decoding for LLMs
- URL: http://arxiv.org/abs/2409.00142v1
- Date: Fri, 30 Aug 2024 03:27:48 GMT
- Title: Dynamic Depth Decoding: Faster Speculative Decoding for LLMs
- Authors: Oscar Brown, Zhengjie Wang, Andrea Do, Nikhil Mathew, Cheng Yu,
- Abstract summary: We introduce Dynamic Depth Decoding (DDD), which optimises Eagle-2's tree drafting method using a dynamic depth.
This extends the average speedup that Eagle-2 achieves over Eagle by $44%$, giving DDD an average speedup of $3.16$x.
- Score: 8.071750249796459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The acceleration of Large Language Models (LLMs) with speculative decoding provides a significant runtime improvement without any loss of accuracy. Currently, EAGLE-2 is the state-of-the-art speculative decoding method, improving on EAGLE with a dynamic draft tree. We introduce Dynamic Depth Decoding (DDD), which optimises EAGLE-2's tree drafting method using a dynamic depth. This extends the average speedup that EAGLE-2 achieves over EAGLE by $44\%$, giving DDD an average speedup of $3.16$x.
Related papers
- Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree [7.438117410146904]
Falcon is an innovative speculative decoding framework fashioned to augment both the drafter's parallelism and output quality.
Falcon incorporates the Coupled Sequential Glancing Distillation technique, which fortifies inter-token dependencies within the same block, leading to increased speculation accuracy.
arXiv Detail & Related papers (2024-12-17T08:02:08Z) - EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees [25.703729145091483]
In this paper, we propose a new technique of context-aware dynamic draft tree into drafting modeling.
We conducted extensive evaluations on three series of Large Language Models (LLMs) and six tasks.
arXiv Detail & Related papers (2024-06-24T17:59:11Z) - Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models.
We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process.
Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z) - GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative
Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding.
GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM.
We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z) - EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty [28.07947754770082]
Autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level.
The inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance.
arXiv Detail & Related papers (2024-01-26T18:59:01Z) - ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive
Sparse Anchor Generation [50.01244854344167]
We bridge the performance gap between sparse and dense detectors by proposing Adaptive Sparse Anchor Generator (ASAG)
ASAG predicts dynamic anchors on patches rather than grids in a sparse way so that it alleviates the feature conflict problem.
Our method outperforms dense-d ones and achieves a better speed-accuracy trade-off.
arXiv Detail & Related papers (2023-08-18T02:06:49Z) - Improving Dual-Encoder Training through Dynamic Indexes for Negative
Mining [61.09807522366773]
We introduce an algorithm that approximates the softmax with provable bounds and that dynamically maintains the tree.
In our study on datasets with over twenty million targets, our approach cuts error by half in relation to oracle brute-force negative mining.
arXiv Detail & Related papers (2023-03-27T15:18:32Z) - Highly Parallel Autoregressive Entity Linking with Discriminative
Correction [51.947280241185]
We propose a very efficient approach that parallelizes autoregressive linking across all potential mentions.
Our model is >70 times faster and more accurate than the previous generative method.
arXiv Detail & Related papers (2021-09-08T17:28:26Z) - Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.