Related papers: SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

URL: http://arxiv.org/abs/2509.12817v1
Date: Tue, 16 Sep 2025 08:36:05 GMT
Title: SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention
Authors: Yuan Cao, Dong Wang,
Abstract summary: We introduce input-adaptive learnable gates to selectively modulate information aggregation into the $KV$ feature map.<n>SAGA achieves a 1.76$times$ improvement in throughput and a 2.69$times$ reduction in peak GPU memory compared to PVT-T.<n>It improves top-1 accuracy by up to 4.4% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.
Score: 10.607730369798551
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternative by reformulating the attention computation from $(QK)V$ to $Q(KV)$, thereby reducing the complexity from $\mathcal{O}(N^2)$ to $\mathcal{O}(N)$ while preserving the global receptive field. However, most existing methods compress historical key-value (KV) information uniformly, which can lead to feature redundancy and the loss of directional alignment with the query (Q). This uniform compression results in low-rank $KV$ feature maps, contributing to a performance gap compared to softmax attention. To mitigate this limitation, we propose \textbf{S}elective \textbf{A}daptive \textbf{GA}ting for Efficient and Expressive Linear Attention (SAGA) , which introduces input-adaptive learnable gates to selectively modulate information aggregation into the $KV$ feature map. These gates enhance semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention. Additionally, we propose an efficient Hadamard-product decomposition method for gate computation, which introduces no additional memory overhead. Experiments demonstrate that SAGA achieves a 1.76$\times$ improvement in throughput and a 2.69$\times$ reduction in peak GPU memory compared to PVT-T at a resolution of $1280 \times 1280$. Moreover, it improves top-1 accuracy by up to 4.4\% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.

Related papers

LINA: Linear Autoregressive Image Generative Models with Continuous Tokens [56.80443965097921]
Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis.<n>We study how to design compute-efficient linear attention within this framework.<n>We present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions.
arXiv Detail & Related papers (2026-01-30T06:44:33Z)
GSPN-2: Efficient Parallel Sequence Modeling [101.33780567131716]
Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme.<n>GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications.
arXiv Detail & Related papers (2025-11-28T07:26:45Z)
VAMO: Efficient Large-Scale Nonconvex Optimization via Adaptive Zeroth Order Variance Reduction [3.130722489512822]
VAMO combines FO mini-batch gradients with ZO finite-difference probes under an ZOG-style framework.<n>VAMO outperforms established FO and ZO methods, offering a faster, more flexible option for improved efficiency.
arXiv Detail & Related papers (2025-05-20T05:31:15Z)
A3 : an Analytical Low-Rank Approximation Framework for Attention [14.649496050074735]
We propose $tt Attt 3$, a post-training low-rank approximation framework.<n>We show that $tt Attt 3$ maintains superior performance compared to SoTAs.<n>We also demonstrate the versatility of $tt Att 3$, including KV cache compression, quantization, and mixed-rank assignments for enhanced performance.
arXiv Detail & Related papers (2025-05-19T10:29:32Z)
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving [10.835583587146274]
This paper presents PSA, a $underlineP$rogressive $underlineS$parse $underlineA$ttention mechanism.<n>It integrates algorithmic innovations with system co-design to achieve both high inference accuracy and improved efficiency in large language models.<n>Experiments demonstrate that PSA reduces KV cache usage for attention computation by up to 2.4$times$ and 8.8$times$, and increases end-to-end serving throughput by up to 1.4$times$ and 2.0$times$.
arXiv Detail & Related papers (2025-03-01T07:56:42Z)
Order-Optimal Projection-Free Algorithm for Adversarially Constrained Online Convex Optimization [29.705337940879705]
Projection-based algorithms for constrained Online Convex Optimization (COCO) face scalability challenges in high-dimensional settings.<n>This paper introduces a projection-free algorithm for COCO that achieves state-of-the-art performance guarantees while eliminating the need for projections.
arXiv Detail & Related papers (2025-02-23T23:18:40Z)
Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization.<n>We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution.<n>Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z)
Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers [18.469378618426294]
We introduce Hamming Attention Distillation (HAD), a framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains.<n>We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention.
arXiv Detail & Related papers (2025-02-03T19:24:01Z)
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z)
A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation [66.26739783789387]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning. MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.