Related papers: Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models

Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models

URL: http://arxiv.org/abs/2602.03681v1
Date: Tue, 03 Feb 2026 16:02:50 GMT
Title: Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models
Authors: Difan Deng, Andreas Bentzen Winje, Lukas Fehring, Marius Lindauer,
Abstract summary: We propose a framework that applies both linear attention and softmax attention operations within the same layer on different tokens.<n>NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact.<n>By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.
Score: 7.961563754693873
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential model. These linear attention models compress past KV values into a single hidden state, thereby efficiently reducing complexity during both training and inference. However, their expressivity remains limited by the size of their hidden state. Previous work proposed interleaving softmax and linear attention layers to reduce computational complexity while preserving expressivity. Nevertheless, the efficiency of these models remains bottlenecked by their softmax attention layers. In this paper, we propose Neural Attention Search Linear (NAtS-L), a framework that applies both linear attention and softmax attention operations within the same layer on different tokens. NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact and can be encoded into fixed-size hidden states, or require softmax attention, i.e., tokens that contain information related to long-term retrieval and need to be preserved for future queries. By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.

Related papers

STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs [23.745366354566315]
Linearizing pretrained large language models (LLMs) primarily relies on intra-layer hybrid attention mechanisms.<n>We propose STILL, an intra-layer hybrid linearization framework for efficiently linearizing LLMs.
arXiv Detail & Related papers (2026-02-02T14:49:18Z)
LINA: Linear Autoregressive Image Generative Models with Continuous Tokens [56.80443965097921]
Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis.<n>We study how to design compute-efficient linear attention within this framework.<n>We present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions.
arXiv Detail & Related papers (2026-01-30T06:44:33Z)
SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention [50.99430451151184]
Linear attention reduces the cost to O(N), yet its compressed state representations can impair modeling capacity and accuracy.<n>We present an analytical study that contrasts linear and softmax attention for visual representation learning.<n>We propose SoLA-Vision, a flexible layer-wise hybrid attention backbone.
arXiv Detail & Related papers (2026-01-16T10:26:53Z)
Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers [36.26426380985327]
Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost limits scaling to long token sequences.<n>Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation.<n>We introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences.
arXiv Detail & Related papers (2025-12-18T14:53:12Z)
Long-Context Generalization with Sparse Attention [21.400056571592277]
Transformer-based architectures traditionally employ softmax to compute attention weights.<n>As sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse.<n>We show that dynamically sparse attention mechanisms using $alpha$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens.
arXiv Detail & Related papers (2025-06-19T22:43:25Z)
Log-Linear Attention [81.09631871212211]
This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention.<n>We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length.<n>Log-linear attention is a general framework and can be applied on top of existing linear attention variants.
arXiv Detail & Related papers (2025-06-05T08:44:51Z)
Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention.<n>We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors.<n> Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.