What Layers When: Learning to Skip Compute in LLMs with Residual Gates
- URL: http://arxiv.org/abs/2510.13876v2
- Date: Fri, 17 Oct 2025 07:30:17 GMT
- Title: What Layers When: Learning to Skip Compute in LLMs with Residual Gates
- Authors: Filipe Laitenberger, Dawid Kopiczko, Cees G. M. Snoek, Yuki M. Asano,
- Abstract summary: GateSkip is a residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs.<n>Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch's output before it re-enters the residual stream.
- Score: 66.23658560048241
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch's output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining over 90% of baseline accuracy. For increasingly larger models, this tradeoff improves drastically. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.
Related papers
- AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth [23.442686851761298]
AdaPonderLM is a self-supervised recurrent language model that learns token-wise early exiting during pretraining.<n>AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy.
arXiv Detail & Related papers (2026-03-02T14:28:16Z) - Data-Free Pruning of Self-Attention Layers in LLMs [1.7188280334580195]
We propose Gate-Norm, a one-shot, weight-only criterion that ranks attention sublayers by query-key coupling.<n>Gate-Norm removes the least coupled ones, requiring no calibration data, no forward passes, no fine-tuning, and no specialized kernels.
arXiv Detail & Related papers (2025-12-03T07:47:49Z) - Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z) - Dr.LLM: Dynamic Layer Routing in LLMs [55.11953638340419]
Dr.LLM is a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block.<n>On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average.
arXiv Detail & Related papers (2025-10-14T17:51:26Z) - R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs [10.702409298302547]
SeerAttention learns the block-level attention sparsity from the Large Language Models itself.<n>Inspired by the gating mechanism in Mixture of Experts (MoE), SeerAttention augments the conventional attention with a learnable gate.<n>Our evaluation results demonstrate that SeerAttention achieves better model accuracy and lower latency for long-context pre-filling.
arXiv Detail & Related papers (2024-10-17T07:07:09Z) - Automatic Channel Pruning for Multi-Head Attention [0.11049608786515838]
We propose an automatic channel pruning method to take into account the multi-head attention mechanism.
On ImageNet-1K, applying our pruning method to the FLattenTransformer, shows outperformed accuracy for several MACs.
arXiv Detail & Related papers (2024-05-31T14:47:20Z) - Automated Sizing and Training of Efficient Deep Autoencoders using
Second Order Algorithms [0.46040036610482665]
We propose a multi-step training method for generalized linear classifiers.
validation error is minimized by pruning of unnecessary inputs.
desired outputs are improved via a method similar to the Ho-Kashyap rule.
arXiv Detail & Related papers (2023-08-11T16:48:31Z) - Transkimmer: Transformer Learns to Layer-wise Skim [17.188613474427054]
One of the major computational inefficiency of Transformer-based models is that they spend identical amount of computation throughout all layers.
We propose Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer.
The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers.
arXiv Detail & Related papers (2022-05-15T16:23:30Z) - GDP: Stabilized Neural Network Pruning via Gates with Differentiable
Polarization [84.57695474130273]
Gate-based or importance-based pruning methods aim to remove channels whose importance is smallest.
GDP can be plugged before convolutional layers without bells and whistles, to control the on-and-off of each channel.
Experiments conducted over CIFAR-10 and ImageNet datasets show that the proposed GDP achieves the state-of-the-art performance.
arXiv Detail & Related papers (2021-09-06T03:17:10Z) - Learned Token Pruning for Transformers [39.181816379061374]
Learned Token Pruning () method reduces redundant tokens as the data passes through the different layers of a transformer.
We extensively test the performance of our approach on multiple GLUE tasks.
Preliminary results show up to 1.4x and 1.9x throughput improvement on Tesla T4 and Intel Haswell.
arXiv Detail & Related papers (2021-07-02T09:00:13Z) - Memory-efficient Transformers via Top-$k$ Attention [23.672065688109395]
In this work, we propose a simple yet highly accurate approximation for vanilla attention.
We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys.
We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.
arXiv Detail & Related papers (2021-06-13T02:30:23Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z) - Pruning Neural Belief Propagation Decoders [77.237958592189]
We introduce a method to tailor an overcomplete parity-check matrix to (neural) BP decoding using machine learning.
We achieve performance within 0.27 dB and 1.5 dB of the ML performance while reducing the complexity of the decoder.
arXiv Detail & Related papers (2020-01-21T12:05:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.