Mamba Modulation: On the Length Generalization of Mamba
- URL: http://arxiv.org/abs/2509.19633v2
- Date: Fri, 24 Oct 2025 10:19:29 GMT
- Title: Mamba Modulation: On the Length Generalization of Mamba
- Authors: Peng Lu, Jerry Huang, Qiuhao Zeng, Xinyu Wang, Boxing Chen, Philippe Langlais, Yufei Cui,
- Abstract summary: Mamba is a leading architecture for state-space language models.<n>We show that Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training.<n>We propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization.
- Score: 34.91142589654215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $\mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^N\Delta_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $\mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $\mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $\Delta_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.
Related papers
- Gather-Scatter Mamba: Accelerating Propagation with Efficient State Space Model [15.551773379039675]
State Space Models (SSMs) have historically played a central role in sequential modeling.<n>Recent advances in selective SSMs like Mamba offer a compelling alternative.<n>We propose a hybrid architecture that combines shifted window self-attention for spatial context aggregation with Mamba-based selective scanning for efficient temporal propagation.
arXiv Detail & Related papers (2025-10-01T13:11:13Z) - Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression [90.93281146423378]
Mamba is an efficient Transformer alternative with linear complexity for long-sequence modeling.<n>Recent empirical works demonstrate Mamba's in-context learning (ICL) competitive with Transformers.<n>This paper studies the training dynamics of Mamba on the linear regression ICL task.
arXiv Detail & Related papers (2025-09-28T09:48:49Z) - Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models [68.31088463716269]
We propose a structured sparse parametrization of transition matrices in state-space models (SSMs)<n>Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$)<n>The model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks.
arXiv Detail & Related papers (2025-09-26T12:46:30Z) - Achilles' Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data [52.07689534063587]
State Space Models (SSMs) have emerged as promising alternatives to attention mechanisms.<n>In this work, we use carefully designed synthetic tasks to reveal Mamba's inherent limitations.
arXiv Detail & Related papers (2025-09-22T08:38:55Z) - Sequential-Parallel Duality in Prefix Scannable Models [68.39855814099997]
Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba.<n>This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference?
arXiv Detail & Related papers (2025-06-12T17:32:02Z) - Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling [50.994194925685434]
LrcSSM is a $textitnon-linear$ recurrent model that processes long sequences as fast as today's linear state-space layers.<n>By forcing its Jacobian matrix to be diagonal, the full sequence can be solved in parallel.<n>LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 do not provide.
arXiv Detail & Related papers (2025-05-27T20:02:59Z) - UmambaTSF: A U-shaped Multi-Scale Long-Term Time Series Forecasting Method Using Mamba [7.594115034632109]
We propose UmambaTSF, a novel long-term time series forecasting framework.
It integrates multi-scale feature extraction capabilities of U-shaped encoder-decoder multilayer perceptrons (MLP) with Mamba's long sequence representation.
UmambaTSF achieves state-of-the-art performance and excellent generality on widely used benchmark datasets.
arXiv Detail & Related papers (2024-10-15T04:56:43Z) - SIGMA: Selective Gated Mamba for Sequential Recommendation [56.85338055215429]
Mamba, a recent advancement, has exhibited exceptional performance in time series prediction.<n>We introduce a new framework named Selective Gated Mamba ( SIGMA) for Sequential Recommendation.<n>Our results indicate that SIGMA outperforms current models on five real-world datasets.
arXiv Detail & Related papers (2024-08-21T09:12:59Z) - SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering [5.016335384639901]
Multi-modal input of Audio-Visual Question Answering (AVQA) makes feature extraction and fusion processes more challenging.
We propose SHMamba: Structured Hyperbolic State Space Model to integrate the advantages of hyperbolic geometry and state space models.
Our method demonstrates superiority among all current major methods and is more suitable for practical application scenarios.
arXiv Detail & Related papers (2024-06-14T08:43:31Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.