Related papers: Dynamic Depth Decoding: Faster Speculative Decoding for LLMs

Dynamic Depth Decoding: Faster Speculative Decoding for LLMs

URL: http://arxiv.org/abs/2409.00142v1
Date: Fri, 30 Aug 2024 03:27:48 GMT
Title: Dynamic Depth Decoding: Faster Speculative Decoding for LLMs
Authors: Oscar Brown, Zhengjie Wang, Andrea Do, Nikhil Mathew, Cheng Yu,
Abstract summary: We introduce Dynamic Depth Decoding (DDD), which optimises Eagle-2's tree drafting method using a dynamic depth. This extends the average speedup that Eagle-2 achieves over Eagle by $44%$, giving DDD an average speedup of $3.16$x.
Score: 8.071750249796459
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The acceleration of Large Language Models (LLMs) with speculative decoding provides a significant runtime improvement without any loss of accuracy. Currently, EAGLE-2 is the state-of-the-art speculative decoding method, improving on EAGLE with a dynamic draft tree. We introduce Dynamic Depth Decoding (DDD), which optimises EAGLE-2's tree drafting method using a dynamic depth. This extends the average speedup that EAGLE-2 achieves over EAGLE by $44\%$, giving DDD an average speedup of $3.16$x.

Related papers

DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding [7.204881999658682]
We introduce DEL, a plug-and-play method that adaptively selects the exit layer and speculation length during inference. Del achieves overall speedups of $2.16times$$sim$$2.50times$ over vanilla auto-regressive decoding.
arXiv Detail & Related papers (2025-04-08T01:12:59Z)
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test [25.703729145091483]
A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. EAGLE-3 abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data.
arXiv Detail & Related papers (2025-03-03T18:59:04Z)
Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree [7.438117410146904]
Falcon is an innovative speculative decoding framework fashioned to augment both the drafter's parallelism and output quality. Falcon incorporates the Coupled Sequential Glancing Distillation technique, which fortifies inter-token dependencies within the same block, leading to increased speculation accuracy.
arXiv Detail & Related papers (2024-12-17T08:02:08Z)
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees [25.703729145091483]
In this paper, we propose a new technique of context-aware dynamic draft tree into drafting modeling. We conducted extensive evaluations on three series of Large Language Models (LLMs) and six tasks.
arXiv Detail & Related papers (2024-06-24T17:59:11Z)
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models. We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process. Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z)
GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding. GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM. We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z)
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty [28.07947754770082]
Autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level. The inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance.
arXiv Detail & Related papers (2024-01-26T18:59:01Z)
ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation [50.01244854344167]
We bridge the performance gap between sparse and dense detectors by proposing Adaptive Sparse Anchor Generator (ASAG) ASAG predicts dynamic anchors on patches rather than grids in a sparse way so that it alleviates the feature conflict problem. Our method outperforms dense-d ones and achieves a better speed-accuracy trade-off.
arXiv Detail & Related papers (2023-08-18T02:06:49Z)
Improving Dual-Encoder Training through Dynamic Indexes for Negative Mining [61.09807522366773]
We introduce an algorithm that approximates the softmax with provable bounds and that dynamically maintains the tree. In our study on datasets with over twenty million targets, our approach cuts error by half in relation to oracle brute-force negative mining.
arXiv Detail & Related papers (2023-03-27T15:18:32Z)
Highly Parallel Autoregressive Entity Linking with Discriminative Correction [51.947280241185]
We propose a very efficient approach that parallelizes autoregressive linking across all potential mentions. Our model is >70 times faster and more accurate than the previous generative method.
arXiv Detail & Related papers (2021-09-08T17:28:26Z)
Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder. The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead. Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.