Related papers: State Rank Dynamics in Linear Attention LLMs

State Rank Dynamics in Linear Attention LLMs

URL: http://arxiv.org/abs/2602.02195v1
Date: Mon, 02 Feb 2026 15:00:42 GMT
Title: State Rank Dynamics in Linear Attention LLMs
Authors: Ao Sun, Hongtao Zhang, Heng Zhou, Yixuan Ma, Yiran Qin, Tongrui Su, Yan Liu, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He,
Abstract summary: State Rank Stratification is characterized by a distinct spectral bifurcation among linear attention heads.<n>Low-rank heads are indispensable for model reasoning, whereas high-rank heads exhibit significant redundancy.<n>We propose Joint Rank-Norm Pruning, a zero-shot strategy that achieves a 38.9% reduction in KV-cache overhead while largely maintaining model accuracy.
Score: 37.607046806053035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Linear Attention Large Language Models (LLMs) offer a compelling recurrent formulation that compresses context into a fixed-size state matrix, enabling constant-time inference. However, the internal dynamics of this compressed state remain largely opaque. In this work, we present a comprehensive study on the runtime state dynamics of state-of-the-art Linear Attention models. We uncover a fundamental phenomenon termed State Rank Stratification, characterized by a distinct spectral bifurcation among linear attention heads: while one group maintains an effective rank oscillating near zero, the other exhibits rapid growth that converges to an upper bound. Extensive experiments across diverse inference contexts reveal that these dynamics remain strikingly consistent, indicating that the identity of a head,whether low-rank or high-rank,is an intrinsic structural property acquired during pre-training, rather than a transient state dependent on the input data. Furthermore, our diagnostic probes reveal a surprising functional divergence: low-rank heads are indispensable for model reasoning, whereas high-rank heads exhibit significant redundancy. Leveraging this insight, we propose Joint Rank-Norm Pruning, a zero-shot strategy that achieves a 38.9\% reduction in KV-cache overhead while largely maintaining model accuracy.

Related papers

Attention in Constant Time: Vashista Sparse Attention for Long-Context Decoding with Exponential Guarantees [0.0]
Large language models spend most of their inference cost on attention over long contexts.<n>We formalize this phenomenon by modeling attention as a projection onto the convex hull of key vectors.<n>We introduce Vashista Sparse Attention, a drop-in mechanism that maintains a small candidate set per query.
arXiv Detail & Related papers (2026-02-14T14:29:10Z)
STORE: Semantic Tokenization, Orthogonal Rotation and Efficient Attention for Scaling Up Ranking Models [11.965535230928372]
Store is a unified and scalable token-based ranking framework built upon three core innovations.<n>Our framework consistently improves prediction accuracy(online CTR by 2.71%, AUC by 1.195%) and training effeciency (1.84 throughput)
arXiv Detail & Related papers (2025-11-24T06:20:02Z)
RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion [64.49056527678606]
We propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the radar-temporal encoder.<n>Unlike prior approaches, our method integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion.<n>Our experiments and evaluations demonstrate that the proposed method significantly outperforms state-of-the-art approaches, robustness local fidelity, generalization, and superior in complex precipitation forecasting scenarios.
arXiv Detail & Related papers (2025-10-16T17:59:13Z)
Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z)
Neural Collapse in Cumulative Link Models for Ordinal Regression: An Analysis with Unconstrained Feature Model [5.339955242953934]
We show that a phenomenon we call Ordinal Neural Collapse (ONC) indeed emerges and is characterized by the following three properties.<n>In particular, in the zero-regularization limit, a highly local and simple geometric relationship emerges between the latent variables and the threshold values.
arXiv Detail & Related papers (2025-06-06T06:57:02Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Modeling [29.612011138019255]
We study the emergence of unimodal representation dynamics in diffusion models.<n>The unimodality arises from an interplay between denoising strength and class confidence across noise scales.<n>In classification tasks, the presence of unimodal dynamics reliably reflects the generalization of the diffusion model.
arXiv Detail & Related papers (2025-02-09T01:58:28Z)
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights. We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z)
A phase transition between positional and semantic learning in a solvable model of dot-product attention [30.96921029675713]
Morelinear model dot-product attention is studied as a non-dimensional self-attention layer with trainable and low-dimensional query and key data. We show that either a positional attention mechanism (with tokens each other based on their respective positions) or a semantic attention mechanism (with tokens tied to each other based their meaning) or a transition from the former to the latter with increasing sample complexity.
arXiv Detail & Related papers (2024-02-06T11:13:54Z)
Rank Collapse Causes Over-Smoothing and Over-Correlation in Graph Neural Networks [3.566568169425391]
We show that with increased depth, node representations become dominated by a low-dimensional subspace that depends on the aggregation function but not on the feature transformations. For all aggregation functions, the rank of the node representations collapses, resulting in over-smoothing for particular aggregation functions.
arXiv Detail & Related papers (2023-08-31T15:22:31Z)
Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning. GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients. This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.