Related papers: The Key to State Reduction in Linear Attention: A Rank-based Perspective

The Key to State Reduction in Linear Attention: A Rank-based Perspective

URL: http://arxiv.org/abs/2602.04852v2
Date: Thu, 12 Feb 2026 17:34:10 GMT
Title: The Key to State Reduction in Linear Attention: A Rank-based Perspective
Authors: Philipp Nazari, T. Konstantin Rusch,
Abstract summary: Recent empirical results indicate that the hidden state of trained linear attention models often exhibits a low-rank structure.<n>We provide a theoretical analysis of the role of rank in linear attention, revealing that low effective rank can affect retrieval error by amplifying query noise.<n>In addition to these theoretical insights, we conjecture that the low-rank states can be substantially reduced post-training.
Score: 8.006873922525275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Linear attention offers a computationally efficient yet expressive alternative to softmax attention. However, recent empirical results indicate that the hidden state of trained linear attention models often exhibits a low-rank structure, suggesting that these models underexploit their capacity in practice. To illuminate this phenomenon, we provide a theoretical analysis of the role of rank in linear attention, revealing that low effective rank can affect retrieval error by amplifying query noise. In addition to these theoretical insights, we conjecture that the low-rank states can be substantially reduced post-training with only minimal performance degradation, yielding faster and more memory-efficient models. To this end, we propose a novel hardware-aware approach that structurally prunes key and query matrices, reducing the state size while retaining compatibility with existing CUDA kernels. We adapt several existing pruning strategies to fit our framework and, building on our theoretical analysis, propose a novel structured pruning method based on a rank-revealing QR decomposition. Our empirical results, evaluated across models of varying sizes and on various downstream tasks, demonstrate the effectiveness of our state reduction framework. We highlight that our framework enables the removal of 50% of the query and key channels at only a marginal increase in perplexity. The code for this project can be found at https://github.com/camail-official/LinearAttentionPruning.

Related papers

On the Limits of Layer Pruning for Generative Reasoning in LLMs [0.5437050212139086]
Layer pruning can compress large language models (LLMs) while retaining strong performance on classification benchmarks with little or no finetuning.<n>We find that tasks requiring multi-step reasoning are particularly sensitive to depth reduction.<n>Under realistic post-training constraints, we evaluate a simple mitigation strategy based on supervised finetuning.
arXiv Detail & Related papers (2026-02-02T11:57:22Z)
The Inlet Rank Collapse in Implicit Neural Representations: Diagnosis and Unified Remedy [30.776360295485762]
Implicit Neural Representations (INRs) have revolutionized continuous signal modeling, yet they struggle to recover fine-grained details within finite training budgets.<n>We introduce a structural diagnostic framework to identify the Inlet Rank Collapse'', a phenomenon where the low-dimensional input coordinates fail to span the high-dimensional embedding space.<n>We derive a Rank-Expanding Initialization, a minimalist remedy that ensures the representation rank scales with the layer width without architectural modifications or computational overhead.
arXiv Detail & Related papers (2026-02-02T01:38:19Z)
Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z)
C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression [4.10373648742522]
Pruning is a widely used technique that prompts sparsity in model structures.<n>We propose a novel one-shot pruning framework that relies on explainable deep learning.<n>Our method consistently achieves substantial reductions in model size, with minimal impact on performance, and without the need for fine-tuning.
arXiv Detail & Related papers (2025-10-21T13:40:11Z)
Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs [15.23174472320989]
Large Language Models (LLMs) are central to many contemporary AI applications.<n>Recent works in eXplainable AI (XAI) suggest that interpretability can also enable model compression.
arXiv Detail & Related papers (2025-06-16T17:38:36Z)
SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract. We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer. The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z)
A Unified Framework for Soft Threshold Pruning [27.853698217792456]
We reformulate soft threshold pruning as an implicit optimization problem solved using the Iterative Shrinkage-Thresholding Algorithm (ISTA) We derive an optimal threshold scheduler through an in-depth study of threshold scheduling based on our framework. In principle, the derived pruning algorithm could sparsify any mathematical model trained via SGD.
arXiv Detail & Related papers (2023-02-25T08:16:14Z)
Towards Practical Control of Singular Values of Convolutional Layers [65.25070864775793]
Convolutional neural networks (CNNs) are easy to train, but their essential properties, such as generalization error and adversarial robustness, are hard to control. Recent research demonstrated that singular values of convolutional layers significantly affect such elusive properties. We offer a principled approach to alleviating constraints of the prior art at the expense of an insignificant reduction in layer expressivity.
arXiv Detail & Related papers (2022-11-24T19:09:44Z)
Towards Deeper Deep Reinforcement Learning [42.960199987696306]
In computer vision and natural language processing, state-of-the-art reinforcement learning algorithms often use only small intrinsics. We show that dataset size is not the limiting factor, and instead argue that instability from the actor in SAC taking gradients through the critic is the culprit.
arXiv Detail & Related papers (2021-06-02T13:41:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.