SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention
- URL: http://arxiv.org/abs/2601.11164v1
- Date: Fri, 16 Jan 2026 10:26:53 GMT
- Title: SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention
- Authors: Ruibang Li, Guan Luo, Yiwei Zhang, Jin Gao, Bing Li, Weiming Hu,
- Abstract summary: Linear attention reduces the cost to O(N), yet its compressed state representations can impair modeling capacity and accuracy.<n>We present an analytical study that contrasts linear and softmax attention for visual representation learning.<n>We propose SoLA-Vision, a flexible layer-wise hybrid attention backbone.
- Score: 50.99430451151184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Standard softmax self-attention excels in vision tasks but incurs quadratic complexity O(N^2), limiting high-resolution deployment. Linear attention reduces the cost to O(N), yet its compressed state representations can impair modeling capacity and accuracy. We present an analytical study that contrasts linear and softmax attention for visual representation learning from a layer-stacking perspective. We further conduct systematic experiments on layer-wise hybridization patterns of linear and softmax attention. Our results show that, compared with rigid intra-block hybrid designs, fine-grained layer-wise hybridization can match or surpass performance while requiring fewer softmax layers. Building on these findings, we propose SoLA-Vision (Softmax-Linear Attention Vision), a flexible layer-wise hybrid attention backbone that enables fine-grained control over how linear and softmax attention are integrated. By strategically inserting a small number of global softmax layers, SoLA-Vision achieves a strong trade-off between accuracy and computational cost. On ImageNet-1K, SoLA-Vision outperforms purely linear and other hybrid attention models. On dense prediction tasks, it consistently surpasses strong baselines by a considerable margin. Code will be released.
Related papers
- Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models [7.961563754693873]
We propose a framework that applies both linear attention and softmax attention operations within the same layer on different tokens.<n>NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact.<n>By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.
arXiv Detail & Related papers (2026-02-03T16:02:50Z) - Softmax Linear Attention: Reclaiming Global Competition [28.81301173774774]
We propose textbfSoftmax Linear Attention (SLA), a framework designed to restore competitive selection without sacrificing efficiency.<n>Experiments demonstrate SLA consistently enhances state-of-the-art linear baselines across language modeling and long-context benchmarks.
arXiv Detail & Related papers (2026-02-02T07:25:03Z) - Rectifying Magnitude Neglect in Linear Attention [57.097694292570885]
Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention.<n>We propose Magnitude-Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query's magnitude.
arXiv Detail & Related papers (2025-07-01T11:49:05Z) - Gating is Weighting: Understanding Gated Linear Attention through In-context Learning [48.90556054777393]
Gated Linear Attention (GLA) architectures include competitive models such as Mamba and RWKV.<n>We show that a multilayer GLA can implement a general class of Weighted Preconditioned Gradient Descent (WPGD) algorithms.<n>Under mild conditions, we establish the existence and uniqueness (up to scaling) of a global minimum, corresponding to a unique WPGD solution.
arXiv Detail & Related papers (2025-04-06T00:37:36Z) - Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention.<n>We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors.<n> Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z) - Breaking the Low-Rank Dilemma of Linear Attention [61.55583836370135]
Linear attention provides a far more efficient solution by reducing the complexity to linear levels.<n>Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map.<n>We introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency.
arXiv Detail & Related papers (2024-11-12T08:30:59Z) - Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
Latent Attention [100.81495948184649]
We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text.
Our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models.
arXiv Detail & Related papers (2022-11-21T18:22:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.