AllMem: A Memory-centric Recipe for Efficient Long-context Modeling
- URL: http://arxiv.org/abs/2602.13680v1
- Date: Sat, 14 Feb 2026 09:04:28 GMT
- Title: AllMem: A Memory-centric Recipe for Efficient Long-context Modeling
- Authors: Ziming Wang, Xiang Wang, Kailong Peng, Lang Qin, Juan Gabriel Kostelec, Christos Sourmpis, Axel Laborieux, Qinghai Guo,
- Abstract summary: Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks.<n>We introduce textscAllMem, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks.
- Score: 32.025154452526856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks due to the computational complexity and memory overhead inherent in the self-attention mechanism. To address these challenges, we introduce \textsc{AllMem}, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks. \textsc{AllMem} enables models to effectively scale to ultra-long contexts while mitigating catastrophic forgetting. This approach not only overcomes the representation constraints typical of linear memory models but also significantly reduces the computational and memory footprint during long-sequence inference. Furthermore, we implement a Memory-Efficient Fine-Tuning strategy to replace standard attention layers in pre-trained models with memory-augmented sliding window layers. This framework facilitates the efficient transformation of any off-the-shelf pre-trained LLM into an \textsc{AllMem}-based architecture. Empirical evaluations confirm that our 4k window model achieves near-lossless performance on 37k LongBench with a marginal 0.83 drop compared to full attention. Furthermore, on InfiniteBench at a 128k context, our 8k window variant outperforms full attention, which validates the effectiveness of our parameterized memory in mitigating noise and maintaining robust long-range modeling without the prohibitive costs of global attention.
Related papers
- LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory [97.14005794889134]
We present LoGeR, a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization.<n>LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning.<n>This memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference.
arXiv Detail & Related papers (2026-03-03T18:55:37Z) - MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z) - Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming [34.16016695663811]
Diffusion-based large language models (dLLMs) have emerged as a promising paradigm, utilizing simultaneous denoising to enable global planning and iterative refinement.<n>Existing inference systems are ill-suited for this paradigm due to severe system inefficiencies.<n>We propose Mosaic, a memory-efficient inference system that shifts from local, static management to a global, dynamic paradigm.
arXiv Detail & Related papers (2026-01-10T13:17:08Z) - MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs [9.086910335841772]
"Memory wall" bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures.<n>We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach.<n>We demonstrate that MoEBlaze can achieve over 4x speedups and over 50% memory savings compared to existing MoE frameworks.
arXiv Detail & Related papers (2026-01-08T08:38:23Z) - Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference [16.71963410333802]
Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks.<n>MoC substantially reduces activation memory during pre-training.<n>MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.
arXiv Detail & Related papers (2025-11-12T13:30:57Z) - Artificial Hippocampus Networks for Efficient Long-Context Modeling [17.23148291364832]
Long-sequence modeling faces a trade-off between the efficiency of compressive fixed-size memory in RNN-like models and the fidelity of growing memory in attention-based Transformers.<n>Inspired by the Multi-Store Model in cognitive science, we introduce a memory framework of artificial neural networks.<n>Experiments on long-context benchmarks LV-Eval and InfiniteBench demonstrate that AHN-augmented models consistently outperform sliding window baselines.
arXiv Detail & Related papers (2025-10-08T17:59:55Z) - iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation [49.8026360054331]
iFlame is a novel transformer-based network architecture for mesh generation.<n>We propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms.<n>Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance.
arXiv Detail & Related papers (2025-03-20T19:10:37Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.