Related papers: STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs

STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs

URL: http://arxiv.org/abs/2602.02180v1
Date: Mon, 02 Feb 2026 14:49:18 GMT
Title: STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs
Authors: Weikang Meng, Liangyu Huo, Yadan Luo, Jiawen Guan, Jingyi Zhang, Yingjian Li, Zheng Zhang,
Abstract summary: Linearizing pretrained large language models (LLMs) primarily relies on intra-layer hybrid attention mechanisms.<n>We propose STILL, an intra-layer hybrid linearization framework for efficiently linearizing LLMs.
Score: 23.745366354566315
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Linearizing pretrained large language models (LLMs) primarily relies on intra-layer hybrid attention mechanisms to alleviate the quadratic complexity of standard softmax attention. Existing methods perform token routing based on sliding-window partitions, resulting in position-based selection and fails to capture token-specific global importance. Meanwhile, linear attention further suffers from distribution shift caused by learnable feature maps that distort pretrained feature magnitudes. Motivated by these limitations, we propose STILL, an intra-layer hybrid linearization framework for efficiently linearizing LLMs. STILL introduces a Self-Saliency Score with strong local-global consistency, enabling accurate token selection using sliding-window computation, and retains salient tokens for sparse softmax attention while summarizing the remaining context via linear attention. To preserve pretrained representations, we design a Norm-Preserved Feature Map (NP-Map) that decouples feature direction from magnitude and reinjects pretrained norms. We further adopt a unified training-inference architecture with chunk-wise parallelization and delayed selection to improve hardware efficiency. Experiments show that STILL matches or surpasses the original pretrained model on commonsense and general reasoning tasks, and achieves up to a 86.2% relative improvement over prior linearized attention methods on long-context benchmarks.

Related papers

LINA: Linear Autoregressive Image Generative Models with Continuous Tokens [56.80443965097921]
Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis.<n>We study how to design compute-efficient linear attention within this framework.<n>We present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions.
arXiv Detail & Related papers (2026-01-30T06:44:33Z)
Dissecting Linear Recurrent Models: How Different Gating Strategies Drive Selectivity and Generalization [5.057995083193427]
Linear recurrent neural networks have emerged as efficient alternatives to the original Transformer's softmax attention mechanism.<n>Existing benchmark tasks are either too simplistic to reveal substantial differences or excessively resource-intensive for experimentation.<n>We introduce SelectivBench, a set of lightweight and customizable synthetic benchmark tasks for systematically evaluating sequence models.
arXiv Detail & Related papers (2026-01-18T21:49:21Z)
Distilling to Hybrid Attention Models via KL-Guided Layer Selection [66.06591032073744]
This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data.<n>We find that this approach is more effective than existing approaches for layer selection, including approaches that uniformly interleave linear attentions based on a fixed ratio.
arXiv Detail & Related papers (2025-12-23T18:12:22Z)
Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization [56.083511902353365]
Reinforcement learning (RL) typically applies uniform credit across an entire generation of Large language models.<n>This work positions attention as a privileged substrate that renders the internal logic of LLMs as a mechanistic blueprint of reasoning itself.<n>We introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes.
arXiv Detail & Related papers (2025-10-15T13:49:51Z)
Customizing the Inductive Biases of Softmax Attention using Structured Matrices [46.30740502186753]
Core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys.<n>We propose new scoring functions based on computationally efficient structured matrices with high ranks, including Block-Train (BTT) and Multi-Level Low Rank (MLR)<n>Our MLR-based attention method achieves improved scaling laws compared to both standard attention and variants of sliding window attention.
arXiv Detail & Related papers (2025-09-09T17:50:58Z)
Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models [85.51753014478315]
We introduce AdaptPrune, a novel plug-and-play training-free pruning method.<n>It builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach.<n>Our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions.
arXiv Detail & Related papers (2025-03-11T03:58:17Z)
Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs [10.52833484759311]
We propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism.<n>It dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget.<n>We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup.
arXiv Detail & Related papers (2025-02-17T08:39:43Z)
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up [64.38715211969516]
We introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token.<n>Experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity.
arXiv Detail & Related papers (2024-12-20T17:57:09Z)
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs [10.702409298302547]
SeerAttention learns the block-level attention sparsity from the Large Language Models itself.<n>Inspired by the gating mechanism in Mixture of Experts (MoE), SeerAttention augments the conventional attention with a learnable gate.<n>Our evaluation results demonstrate that SeerAttention achieves better model accuracy and lower latency for long-context pre-filling.
arXiv Detail & Related papers (2024-10-17T07:07:09Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner. Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.