ADEPT: Adaptive Dynamic Early-Exit Process for Transformers
- URL: http://arxiv.org/abs/2601.03700v1
- Date: Wed, 07 Jan 2026 08:34:41 GMT
- Title: ADEPT: Adaptive Dynamic Early-Exit Process for Transformers
- Authors: Sangmin Yoo, Srikanth Malla, Chiho Choi, Wei D. Lu, Joon Hee Choi,
- Abstract summary: Early-exit strategies have proven effective in reducing computational demands by halting inference earlier.<n>We introduce ADEPT, a novel approach designed to overcome this issue and enable dynamic early exit in both the prefill and generation phases.<n>We show that ADEPT improves efficiency by up to 25% in language generation tasks and achieves a 4x speed-up in downstream classification tasks, with up to a 45% improvement in performance.
- Score: 12.23755727319088
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The inference of large language models imposes significant computational workloads, often requiring the processing of billions of parameters. Although early-exit strategies have proven effective in reducing computational demands by halting inference earlier, they apply either to only the first token in the generation phase or at the prompt level in the prefill phase. Thus, the Key-Value (KV) cache for skipped layers remains a bottleneck for subsequent token generation, limiting the benefits of early exit. We introduce ADEPT (Adaptive Dynamic Early-exit Process for Transformers), a novel approach designed to overcome this issue and enable dynamic early exit in both the prefill and generation phases. The proposed adaptive token-level early-exit mechanism adjusts computation dynamically based on token complexity, optimizing efficiency without compromising performance. ADEPT further enhances KV generation procedure by decoupling sequential dependencies in skipped layers, making token-level early exit more practical. Experimental results demonstrate that ADEPT improves efficiency by up to 25% in language generation tasks and achieves a 4x speed-up in downstream classification tasks, with up to a 45% improvement in performance.
Related papers
- OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration [4.771742494878726]
OUSAC is a framework that accelerates diffusion transformers (DiT) through systematic optimization.<n>Our key insight is that variable guidance scales enable sparse computation.<n>Stage-1 employs evolutionary algorithms to jointly optimize which timesteps to skip and what guidance scale to use.<n>Stage-2 introduces adaptive rank allocation that tailors calibration efforts per transformer block.
arXiv Detail & Related papers (2025-12-16T05:11:54Z) - BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination [14.53308613746613]
BitStopper is a fine-grained algorithm-architecture co-design that operates without a sparsity predictor.<n>It achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.
arXiv Detail & Related papers (2025-12-06T14:44:38Z) - Decoupled Multi-Predictor Optimization for Inference-Efficient Model Tuning [59.27124079347153]
Early exiting in conjunction with multi-stage predictors offers a straightforward way to achieve an inference-efficient model.<n>How can early stages provide low-level fundamental features to deep stages while simultaneously supplying high-level discriminative features to early-stage predictors?<n>We propose a Decoupled Multi-Predictor Optimization (DMPO) method to effectively decouple the low-level representative ability and high-level discriminative ability in early stages.
arXiv Detail & Related papers (2025-11-05T07:16:49Z) - IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method [59.02943805284446]
Iterative Implicit Euler Transformer (IIET)<n>IIAD allows users to effectively balance the performance-efficiency trade-off.<n>E-IIET variant achieves an average performance gain exceeding 1.6% over vanilla Transformer with comparable speed.
arXiv Detail & Related papers (2025-09-26T15:14:03Z) - Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z) - BEExformer: A Fast Inferencing Binarized Transformer with Early Exits [2.7651063843287718]
We introduce Binarized Early Exit Transformer (BEExformer), the first-ever selective learning-based transformer integrating Binarization-Aware Training (BAT) with Early Exit (EE)<n>BAT employs a differentiable second-order approximation to the sign function, enabling gradient that captures both the sign and magnitude of the weights.<n>EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation.<n>This accelerates inference by reducing FLOPs by 52.08% and even improves accuracy by 2.89% by resolving the "overthinking" problem inherent in deep networks.
arXiv Detail & Related papers (2024-12-06T17:58:14Z) - FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction [16.84400858871298]
We propose FiRST, an algorithm that reduces latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence.<n>FiRST preserves compatibility with KV caching enabling faster inference while being quality-aware.<n>Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on tasks.
arXiv Detail & Related papers (2024-10-16T12:45:35Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Constraint-aware and Ranking-distilled Token Pruning for Efficient
Transformer Inference [18.308180927492643]
ToP is a ranking-distilled token distillation technique, which distills effective token rankings from the final layer of unpruned models to early layers of pruned models.
ToP reduces the average FLOPs of BERT by 8.1x while achieving competitive accuracy on GLUE, and provides a real latency speedup of up to 7.4x on an Intel CPU.
arXiv Detail & Related papers (2023-06-26T03:06:57Z) - You Need Multiple Exiting: Dynamic Early Exiting for Accelerating
Unified Vision Language Model [37.24203191658052]
Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture.
Performance improvements come with increasing model size, resulting in slow inference speed and increased cost for severing.
We propose a novel early exiting strategy for unified visual language models, which allows dynamically skip the layers in encoder and decoder simultaneously.
arXiv Detail & Related papers (2022-11-21T02:32:25Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - Easy and Efficient Transformer : Scalable Inference Solution For large
NLP mode [14.321889138798072]
This paper introduces a series of ultra-large-scale pre-training model optimization methods.
An inference engine -- Easy and Efficient Transformer (EET) is proposed.
EET achieves a 1.5-15x state-of-art speedup varying with context length.
arXiv Detail & Related papers (2021-04-26T11:00:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.