Related papers: Data-Free Pruning of Self-Attention Layers in LLMs

Data-Free Pruning of Self-Attention Layers in LLMs

URL: http://arxiv.org/abs/2512.20636v1
Date: Wed, 03 Dec 2025 07:47:49 GMT
Title: Data-Free Pruning of Self-Attention Layers in LLMs
Authors: Dhananjay Saikumar, Blesson Varghese,
Abstract summary: We propose Gate-Norm, a one-shot, weight-only criterion that ranks attention sublayers by query-key coupling.<n>Gate-Norm removes the least coupled ones, requiring no calibration data, no forward passes, no fine-tuning, and no specialized kernels.
Score: 1.7188280334580195
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many self-attention sublayers in large language models (LLMs) can be removed with little to no loss. We attribute this to the Attention Suppression Hypothesis: during pre-training, some deep attention layers learn to mute their own contribution, leaving the residual stream and the MLP to carry the representation. We propose Gate-Norm, a one-shot, weight-only criterion that ranks attention sublayers by query--key coupling and removes the least coupled ones, requiring no calibration data, no forward passes, no fine-tuning, and no specialized kernels. On 40-layer, 13B-parameter LLaMA models, Gate-Norm prunes the model in under a second. Pruning $8$--$16$ attention sublayers yields up to $1.30\times$ higher inference throughput while keeping average zero-shot accuracy within $2\%$ of the unpruned baseline across BoolQ, RTE, HellaSwag, WinoGrande, ARC-Easy/Challenge, and OpenBookQA. Across these settings, Gate-Norm matches data-driven pruning methods in accuracy while being $\sim 1000\times$ faster to score layers, enabling practical, data-free compression of LLMs.

Related papers

The Structural Scalpel: Automated Contiguous Layer Pruning for Large Language Models [33.90597962418094]
We propose CLP, a novel continuous layer pruning framework for large language models.<n>CLP uses differentiable concave gate algorithm that automatically identifies the best continuous layer segments for pruning.<n>CLP can be seamlessly combined with quantization to further compress the model with only a slight performance loss.
arXiv Detail & Related papers (2025-10-25T16:40:17Z)
What Layers When: Learning to Skip Compute in LLMs with Residual Gates [66.23658560048241]
GateSkip is a residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs.<n>Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch's output before it re-enters the residual stream.
arXiv Detail & Related papers (2025-10-13T16:31:50Z)
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
SDMPrune: Self-Distillation MLP Pruning for Efficient Large Language Models [3.962074007736394]
We introduce a self-distillation loss during the pruning phase (rather than post-training) to fully exploit the predictions of the original model.<n>We demonstrate that our method significantly outperforms existing pruning methods.<n>Our method achieves very competitive performance among 1B-scale open source LLMs.
arXiv Detail & Related papers (2025-06-10T02:24:32Z)
Sparsity Forcing: Reinforcing Token Sparsity of MLLMs [40.93786579652003]
We explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named textitSparsity Forcing.<n>Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards.
arXiv Detail & Related papers (2025-04-23T01:45:55Z)
Towards Efficient Automatic Self-Pruning of Large Language Models [55.90119819642064]
Post-training structured pruning is a promising solution that prunes Large Language Models without the need for retraining.<n>We argue that the key to mitigating this issue lies in accurately determining the pruning rate for each layer.<n>We introduce $textbfSelf-Pruner$ an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates.
arXiv Detail & Related papers (2025-02-20T09:59:50Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs) Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.