SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers
- URL: http://arxiv.org/abs/2509.00935v1
- Date: Sun, 31 Aug 2025 17:08:33 GMT
- Title: SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers
- Authors: Aref Jafari, Yuhe Fan, Benyamin Jamialahmadi, Parsa Farinneya, Boxing Chen, Marzieh S. Tahaei,
- Abstract summary: We propose SCOUT, a hybrid architecture that compresses tokens locally within fixed-size segments and applies attention only over these compressed representations.<n>SCOUT retains much of the expressivity of full attention while substantially reducing the computational and memory cost.<n>We analyze SCOUT's computational and memory efficiency and evaluate it empirically on long-context language modeling and reasoning tasks.
- Score: 15.142822497807236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have demonstrated strong performance across a wide range of sequence modeling tasks, but their quadratic attention complexity limits scalability to long sequences. Linear models such as Mamba and sliding-window attention (SWA) address this by mixing tokens through recurrent or localized operations with fixed-size memory, achieving efficient inference. However, these methods risk degrading performance on long sequences due to their inability to retain detailed information from distant tokens. We propose SCOUT (Segment Compression for Optimized Utility in Transformers), a hybrid architecture that compresses tokens locally within fixed-size segments and applies attention only over these compressed representations. Each token embedding is first enriched via a linear local mixer, Mamba or SWA, that integrates recent context. Then, instead of attending to all previous tokens, each token sparsely attends to a small number of compressed checkpoint tokens that summarize the input history. This design retains much of the expressivity of full attention while substantially reducing the computational and memory cost. By attending to compressed history rather than all previous tokens, SCOUT incurs slightly higher memory than purely linear models, but its growth rate remains sub-quadratic and far more scalable than that of full Transformers. We analyze SCOUT's computational and memory efficiency and evaluate it empirically on long-context language modeling and reasoning tasks. SCOUT with both Mamba and SWA mixers outperforms strong long-sequence baselines under the same computational budget, matches full-attention Transformers on language modeling and common-sense reasoning tasks at 400M and 1.3B scales. Moreover, our SCOUT achieves higher end-to-end throughput than SOTA models, while delivering comparable results on long sequence benchmarks.
Related papers
- Memory Caching: RNNs with Growing Memory [56.25483647131372]
We introduce Memory Caching (MC), a technique that enhances recurrent models by caching checkpoints of memory states (a.k.a. hidden states)<n>We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules.<n>The results indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.
arXiv Detail & Related papers (2026-02-27T18:53:41Z) - Attention and Compression is all you need for Controllably Efficient Language Models [16.42720496730602]
Compress & Attend Transformer (CAT) is a conceptually simple architecture employing dense attention and compression.<n>Cat can be trained with multiple chunk sizes at once, unlocking control of quality-compute trade-offs directly at test-time.<n>A single CAT matches dense transformer in language modeling across model scales while being 1.4-3x faster and requiring 2-9x lower total memory usage.
arXiv Detail & Related papers (2025-11-07T15:13:28Z) - OmniSAT: Compact Action Token, Faster Auto Regression [70.70037017501357]
We introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation.<n>The resulting discrete tokenization shortens the training sequence by 6.8$times$, and lowers the target entropy.
arXiv Detail & Related papers (2025-10-08T03:55:24Z) - Compact Recurrent Transformer with Persistent Memory [16.48606806238812]
The Transformer architecture has shown significant success in many language processing and visual tasks.<n>We propose a novel and efficient Compact Recurrent Transformer (CRT)<n>CRT combines shallow Transformer models that process short local segments with recurrent neural networks to compress and manage a single persistent memory vector.<n>We evaluate CRT on WordPTB and WikiText-103 for next-token-prediction tasks, as well as on the Toyota Smarthome video dataset for classification.
arXiv Detail & Related papers (2025-05-02T00:11:44Z) - Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z) - Attamba: Attending To Multi-Token States [6.5676809841642125]
We introduce Attamba, a novel architecture that uses state-space models to compress chunks of tokens.
We find that replacing key and value projections in a transformer with SSMs can improve model quality and enable flexible token chunking.
Attamba can perform attention on chunked-sequences of variable length, enabling a smooth transition between quadratic and linear scaling.
arXiv Detail & Related papers (2024-11-26T18:52:06Z) - Taipan: Efficient and Expressive State Space Language Models with Selective Attention [100.16383527459429]
Long-context language modeling is a significant challenge in Natural Language Processing (NLP)
Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval.
We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs)
Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
arXiv Detail & Related papers (2024-10-24T09:25:37Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.<n>These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.<n>We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.