Related papers: AllMem: A Memory-centric Recipe for Efficient Long-context Modeling

AllMem: A Memory-centric Recipe for Efficient Long-context Modeling

URL: http://arxiv.org/abs/2602.13680v1
Date: Sat, 14 Feb 2026 09:04:28 GMT
Title: AllMem: A Memory-centric Recipe for Efficient Long-context Modeling
Authors: Ziming Wang, Xiang Wang, Kailong Peng, Lang Qin, Juan Gabriel Kostelec, Christos Sourmpis, Axel Laborieux, Qinghai Guo,
Abstract summary: Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks.<n>We introduce textscAllMem, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks.
Score: 32.025154452526856
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks due to the computational complexity and memory overhead inherent in the self-attention mechanism. To address these challenges, we introduce \textsc{AllMem}, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks. \textsc{AllMem} enables models to effectively scale to ultra-long contexts while mitigating catastrophic forgetting. This approach not only overcomes the representation constraints typical of linear memory models but also significantly reduces the computational and memory footprint during long-sequence inference. Furthermore, we implement a Memory-Efficient Fine-Tuning strategy to replace standard attention layers in pre-trained models with memory-augmented sliding window layers. This framework facilitates the efficient transformation of any off-the-shelf pre-trained LLM into an \textsc{AllMem}-based architecture. Empirical evaluations confirm that our 4k window model achieves near-lossless performance on 37k LongBench with a marginal 0.83 drop compared to full attention. Furthermore, on InfiniteBench at a 128k context, our 8k window variant outperforms full attention, which validates the effectiveness of our parameterized memory in mitigating noise and maintaining robust long-range modeling without the prohibitive costs of global attention.

Related papers

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory [97.14005794889134]
We present LoGeR, a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization.<n>LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning.<n>This memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference.
arXiv Detail & Related papers (2026-03-03T18:55:37Z)
MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z)
Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming [34.16016695663811]
Diffusion-based large language models (dLLMs) have emerged as a promising paradigm, utilizing simultaneous denoising to enable global planning and iterative refinement.<n>Existing inference systems are ill-suited for this paradigm due to severe system inefficiencies.<n>We propose Mosaic, a memory-efficient inference system that shifts from local, static management to a global, dynamic paradigm.
arXiv Detail & Related papers (2026-01-10T13:17:08Z)
MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs [9.086910335841772]
"Memory wall" bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures.<n>We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach.<n>We demonstrate that MoEBlaze can achieve over 4x speedups and over 50% memory savings compared to existing MoE frameworks.
arXiv Detail & Related papers (2026-01-08T08:38:23Z)
Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference [16.71963410333802]
Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks.<n>MoC substantially reduces activation memory during pre-training.<n>MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.
arXiv Detail & Related papers (2025-11-12T13:30:57Z)
Artificial Hippocampus Networks for Efficient Long-Context Modeling [17.23148291364832]
Long-sequence modeling faces a trade-off between the efficiency of compressive fixed-size memory in RNN-like models and the fidelity of growing memory in attention-based Transformers.<n>Inspired by the Multi-Store Model in cognitive science, we introduce a memory framework of artificial neural networks.<n>Experiments on long-context benchmarks LV-Eval and InfiniteBench demonstrate that AHN-augmented models consistently outperform sliding window baselines.
arXiv Detail & Related papers (2025-10-08T17:59:55Z)
iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation [49.8026360054331]
iFlame is a novel transformer-based network architecture for mesh generation.<n>We propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms.<n>Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance.
arXiv Detail & Related papers (2025-03-20T19:10:37Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.