Related papers: EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

URL: http://arxiv.org/abs/2503.01840v1
Date: Mon, 03 Mar 2025 18:59:04 GMT
Title: EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Authors: Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang,
Abstract summary: A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs.<n>We observe that scaling up data provides limited improvements for the Eagle program.<n>We introduce Eagle-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion.
Score: 25.703729145091483
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. The code is available at https://github.com/SafeAILab/EAGLE.

Related papers

S*: Test Time Scaling for Code Generation [55.11863577956177]
We propose S*, the first hybrid test-time scaling framework for code generation. S* substantially improves the coverage and selection accuracy of generated code.
arXiv Detail & Related papers (2025-02-20T09:18:53Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z)
Multi-Objective Large Language Model Unlearning [3.372396620898397]
Gradient Ascent (GA) is a proactive way to decrease the prediction probability of the model on the target data. We propose Multi-Objective Large Language Model Unlearning (MOLLM) algorithm to overcome gradient explosion and catastrophic forgetting. Our empirical results verify that MoLLM outperforms the SOTA GA-based LLM unlearning methods in terms of unlearning effect and model utility preservation.
arXiv Detail & Related papers (2024-12-29T09:35:56Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees [25.703729145091483]
In this paper, we propose a new technique of context-aware dynamic draft tree into drafting modeling. We conducted extensive evaluations on three series of Large Language Models (LLMs) and six tasks.
arXiv Detail & Related papers (2024-06-24T17:59:11Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty [25.703729145091483]
Autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level.<n>The inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance.
arXiv Detail & Related papers (2024-01-26T18:59:01Z)
Learning to Learn to Predict Performance Regressions in Production at Meta [11.45540873578889]
This article gives an account of the experiences we gained when researching and deploying an ML-based regression prediction pipeline at Meta. Our investigation shows the inherent difficulty of the performance prediction problem, which is characterized by a large imbalance of benign onto regressing changes. Our results also call into question the general applicability of Transformer-based architectures for performance prediction.
arXiv Detail & Related papers (2022-08-08T18:16:51Z)
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model. vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.