Related papers: MemMamba: Rethinking Memory Patterns in State Space Model

MemMamba: Rethinking Memory Patterns in State Space Model

URL: http://arxiv.org/abs/2510.03279v1
Date: Sun, 28 Sep 2025 14:40:58 GMT
Title: MemMamba: Rethinking Memory Patterns in State Space Model
Authors: Youjin Wang, Yangjingyi Chen, Jiahao Yan, Jiaxuan Lu, Xiao Sun,
Abstract summary: We show that selective state-space models such as Mamba have high efficiency with O(n) time and O(1) recurrent inference, yet their long-range memory decays exponentially.<n>Inspired by how humans distill and retain salient information when reading long documents, we propose MemMamba.<n>MemMamba achieves significant improvements over existing Mamba variants and Transformers on long-sequence benchmarks.
Score: 6.537535831000493
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: With the explosive growth of data, long-sequence modeling has become increasingly important in tasks such as natural language processing and bioinformatics. However, existing methods face inherent trade-offs between efficiency and memory. Recurrent neural networks suffer from gradient vanishing and explosion, making them hard to scale. Transformers can model global dependencies but are constrained by quadratic complexity. Recently, selective state-space models such as Mamba have demonstrated high efficiency with O(n) time and O(1) recurrent inference, yet their long-range memory decays exponentially. In this work, we conduct mathematical derivations and information-theoretic analysis to systematically uncover the memory decay mechanism of Mamba, answering a fundamental question: what is the nature of Mamba's long-range memory and how does it retain information? To quantify key information loss, we further introduce horizontal-vertical memory fidelity metrics that capture degradation both within and across layers. Inspired by how humans distill and retain salient information when reading long documents, we propose MemMamba, a novel architectural framework that integrates state summarization mechanism together with cross-layer and cross-token attention, which alleviates long-range forgetting while preserving linear complexity. MemMamba achieves significant improvements over existing Mamba variants and Transformers on long-sequence benchmarks such as PG19 and Passkey Retrieval, while delivering a 48% speedup in inference efficiency. Both theoretical analysis and empirical results demonstrate that MemMamba achieves a breakthrough in the complexity-memory trade-off, offering a new paradigm for ultra-long sequence modeling.

Related papers

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents [78.30630000529133]
We propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory.<n> MM-Mem memory structures hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic.<n>Experiments confirm the effectiveness of MM-Mem on both offline and streaming tasks.
arXiv Detail & Related papers (2026-03-02T05:12:45Z)
Memory Caching: RNNs with Growing Memory [56.25483647131372]
We introduce Memory Caching (MC), a technique that enhances recurrent models by caching checkpoints of memory states (a.k.a. hidden states)<n>We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules.<n>The results indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.
arXiv Detail & Related papers (2026-02-27T18:53:41Z)
MambaMIL+: Modeling Long-Term Contextual Patterns for Gigapixel Whole Slide Image [24.093388981091717]
Multiple instance learning (MIL) offers a solution by treating each WSI as a bag of patch-level instances.<n>Mamba has emerged as a promising alternative for long sequence learning, scaling linearly to thousands of tokens.<n>We propose MambaMIL+, a new MIL framework that explicitly integrates spatial context while maintaining long-range dependency modeling.
arXiv Detail & Related papers (2025-12-19T16:01:14Z)
Language Modeling With Factorization Memory [1.9538130634206368]
We propose Factorization Memory, an efficient recurrent neural network (RNN) architecture that achieves performance comparable to Transformer models on short-context language modeling tasks.<n>We develop a sparse formulation of Factorization Memory that updates only a subset of recurrent states at each step while preserving the strong performance of its dense counterpart.
arXiv Detail & Related papers (2025-10-31T23:27:11Z)
ResFormer: All-Time Reservoir Memory for Long Sequence Classification [4.298381633106637]
Sequence classification is essential in NLP for understanding and categorizing language patterns in tasks like sentiment analysis, intent detection, and topic classification.<n> Transformer-based models, despite achieving state-of-the-art performance, have inherent limitations due to quadratic time and memory complexity.<n>We propose ResFormer, a novel neural network architecture designed to model varying context lengths efficiently through a cascaded methodology.
arXiv Detail & Related papers (2025-09-28T21:20:49Z)
BrainMT: A Hybrid Mamba-Transformer Architecture for Modeling Long-Range Dependencies in Functional MRI Data [0.09363323206192666]
Recent advances in deep learning have made it possible to predict phenotypic measures directly from functional magnetic resonance imaging (fMRI) brain volumes.<n>We introduce BrainMT, a novel hybrid framework designed to efficiently learn and integrate long-rangetemporal attributes in fMRI data.<n>Our framework operates in two stages: (1) a bidirectional Mamba block with a temporal-first scanning mechanism to capture global temporal interactions in a computationally efficient manner; and (2) a transformer block leveraging self-attention to model global spatial relationships.
arXiv Detail & Related papers (2025-06-27T19:20:41Z)
Emergence of Primacy and Recency Effect in Mamba: A Mechanistic Point of View [16.8179962093575]
We study memory in state-space language models using primacy and recency effects as behavioral tools to uncover how information is retained and forgotten over time.<n>Applying structured recall tasks to the Mamba architecture, we observe a consistent U-shaped accuracy profile, indicating strong performance at the beginning and end of input sequences.
arXiv Detail & Related papers (2025-06-18T06:02:02Z)
Non-Markovianity and memory enhancement in Quantum Reservoir Computing [0.8437187555622164]
We show that non-Markovian dynamics can overcome limitation, enabling extended memory retention.<n>We introduce an embedding approach that allows a controlled transition from Markovian to non-Markovian evolution.<n>Our results establish quantum non-Markovianity as a key resource for enhancing memory in quantum machine learning architectures.
arXiv Detail & Related papers (2025-05-05T09:17:08Z)
Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning [64.93848182403116]
Current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term. We introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents. Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks.
arXiv Detail & Related papers (2024-10-14T03:50:17Z)
DocMamba: Efficient Document Pre-training with State Space Model [56.84200017560988]
We present DocMamba, a novel framework based on the state space model.<n>It is designed to reduce computational complexity to linear while preserving global modeling capabilities.<n>Experiments on the HRDoc confirm DocMamba's potential for length extrapolation.
arXiv Detail & Related papers (2024-09-18T11:34:28Z)
B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory [91.81390121042192]
We develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an composable module. B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens.
arXiv Detail & Related papers (2024-07-08T18:41:01Z)
Recurrent Action Transformer with Memory [39.58317527488534]
This paper proposes a novel model architecture that incorporates a recurrent memory mechanism designed to regulate information retention. We conduct experiments on memory-intensive environments (ViZDoom-Two-Colors, T-Maze, Memory Maze, Minigrid-Memory), classic Atari games, and MuJoCo control environments. The results show that using memory can significantly improve performance in memory-intensive environments, while maintaining or improving results in classic environments.
arXiv Detail & Related papers (2023-06-15T19:29:08Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.