STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs
- URL: http://arxiv.org/abs/2602.02180v1
- Date: Mon, 02 Feb 2026 14:49:18 GMT
- Title: STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs
- Authors: Weikang Meng, Liangyu Huo, Yadan Luo, Jiawen Guan, Jingyi Zhang, Yingjian Li, Zheng Zhang,
- Abstract summary: Linearizing pretrained large language models (LLMs) primarily relies on intra-layer hybrid attention mechanisms.<n>We propose STILL, an intra-layer hybrid linearization framework for efficiently linearizing LLMs.
- Score: 23.745366354566315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linearizing pretrained large language models (LLMs) primarily relies on intra-layer hybrid attention mechanisms to alleviate the quadratic complexity of standard softmax attention. Existing methods perform token routing based on sliding-window partitions, resulting in position-based selection and fails to capture token-specific global importance. Meanwhile, linear attention further suffers from distribution shift caused by learnable feature maps that distort pretrained feature magnitudes. Motivated by these limitations, we propose STILL, an intra-layer hybrid linearization framework for efficiently linearizing LLMs. STILL introduces a Self-Saliency Score with strong local-global consistency, enabling accurate token selection using sliding-window computation, and retains salient tokens for sparse softmax attention while summarizing the remaining context via linear attention. To preserve pretrained representations, we design a Norm-Preserved Feature Map (NP-Map) that decouples feature direction from magnitude and reinjects pretrained norms. We further adopt a unified training-inference architecture with chunk-wise parallelization and delayed selection to improve hardware efficiency. Experiments show that STILL matches or surpasses the original pretrained model on commonsense and general reasoning tasks, and achieves up to a 86.2% relative improvement over prior linearized attention methods on long-context benchmarks.
Related papers
- LINA: Linear Autoregressive Image Generative Models with Continuous Tokens [56.80443965097921]
Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis.<n>We study how to design compute-efficient linear attention within this framework.<n>We present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions.
arXiv Detail & Related papers (2026-01-30T06:44:33Z) - Dissecting Linear Recurrent Models: How Different Gating Strategies Drive Selectivity and Generalization [5.057995083193427]
Linear recurrent neural networks have emerged as efficient alternatives to the original Transformer's softmax attention mechanism.<n>Existing benchmark tasks are either too simplistic to reveal substantial differences or excessively resource-intensive for experimentation.<n>We introduce SelectivBench, a set of lightweight and customizable synthetic benchmark tasks for systematically evaluating sequence models.
arXiv Detail & Related papers (2026-01-18T21:49:21Z) - Distilling to Hybrid Attention Models via KL-Guided Layer Selection [66.06591032073744]
This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data.<n>We find that this approach is more effective than existing approaches for layer selection, including approaches that uniformly interleave linear attentions based on a fixed ratio.
arXiv Detail & Related papers (2025-12-23T18:12:22Z) - Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization [56.083511902353365]
Reinforcement learning (RL) typically applies uniform credit across an entire generation of Large language models.<n>This work positions attention as a privileged substrate that renders the internal logic of LLMs as a mechanistic blueprint of reasoning itself.<n>We introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes.
arXiv Detail & Related papers (2025-10-15T13:49:51Z) - Customizing the Inductive Biases of Softmax Attention using Structured Matrices [46.30740502186753]
Core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys.<n>We propose new scoring functions based on computationally efficient structured matrices with high ranks, including Block-Train (BTT) and Multi-Level Low Rank (MLR)<n>Our MLR-based attention method achieves improved scaling laws compared to both standard attention and variants of sliding window attention.
arXiv Detail & Related papers (2025-09-09T17:50:58Z) - Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models [85.51753014478315]
We introduce AdaptPrune, a novel plug-and-play training-free pruning method.<n>It builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach.<n>Our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions.
arXiv Detail & Related papers (2025-03-11T03:58:17Z) - Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs [10.52833484759311]
We propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism.<n>It dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget.<n>We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup.
arXiv Detail & Related papers (2025-02-17T08:39:43Z) - CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up [64.38715211969516]
We introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token.<n>Experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity.
arXiv Detail & Related papers (2024-12-20T17:57:09Z) - SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs [10.702409298302547]
SeerAttention learns the block-level attention sparsity from the Large Language Models itself.<n>Inspired by the gating mechanism in Mixture of Experts (MoE), SeerAttention augments the conventional attention with a learnable gate.<n>Our evaluation results demonstrate that SeerAttention achieves better model accuracy and lower latency for long-context pre-filling.
arXiv Detail & Related papers (2024-10-17T07:07:09Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.