Related papers: Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

URL: http://arxiv.org/abs/2506.01963v1
Date: Fri, 09 May 2025 00:25:46 GMT
Title: Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons
Authors: Andrew Kiruluta, Preethi Raju, Priscilla Burity,
Abstract summary: We present a novel non attention based architecture for large language models (LLMs) that efficiently handles very long context windows.<n>Unlike traditional Transformer designs, which suffer from quadratic memory and overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a novel non attention based architecture for large language models (LLMs) that efficiently handles very long context windows, on the order of hundreds of thousands to potentially millions of tokens. Unlike traditional Transformer designs, which suffer from quadratic memory and computation overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely. Instead, it combines the following complementary components: State Space blocks (inspired by S4) that learn continuous time convolution kernels and scale near linearly with sequence length, Multi Resolution Convolution layers that capture local context at different dilation levels, a lightweight Recurrent Supervisor to maintain a global hidden state across sequential chunks, and Retrieval Augmented External Memory that stores and retrieves high-level chunk embeddings without reintroducing quadratic operations.

Related papers

Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z)
Recurrent Memory-Augmented Transformers with Chunked Attention for Long-Context Language Modeling [0.0]
We present a Transformer architecture for language modeling that combines global attention with two biologically inspired components.<n>This unified attention block allows the model to efficiently handle both short-range and long-range dependencies.<n>The architecture is implemented entirely from scratch in PyTorch, with no reliance on high-level libraries.
arXiv Detail & Related papers (2025-07-01T06:11:38Z)
Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System [8.629870144131248]
Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes.<n>We introduce textbfMem4Nav, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone.
arXiv Detail & Related papers (2025-06-24T09:00:43Z)
Long-Context State-Space Video World Models [66.28743632951218]
We propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency.<n>Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory.<n>Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory.
arXiv Detail & Related papers (2025-05-26T16:12:41Z)
Star Attention: Efficient LLM Inference over Long Sequences [17.401430615714]
We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts.<n>Star Attention integrates seamlessly with most Transformer-based Large Language Models trained with global attention, reducing memory requirements and inference time by up to 11x.
arXiv Detail & Related papers (2024-11-26T05:10:04Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences. We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook. LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z)
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models. HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
Landmark Attention: Random-Access Infinite Context Length for Transformers [45.69864961773124]
We present a novel approach that allows access to the complete context while retaining random-access flexibility. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step.
arXiv Detail & Related papers (2023-05-25T17:53:42Z)
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.