Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis
- URL: http://arxiv.org/abs/2510.00399v1
- Date: Wed, 01 Oct 2025 01:25:01 GMT
- Title: Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis
- Authors: Hongkang Li, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Meng Wang,
- Abstract summary: The Mamba model has gained significant attention for its computational advantages over Transformer-based models.<n>This paper presents the first theoretical analysis of the training dynamics of a one-layer Mamba model.<n>We show that although Mamba may require more training to converge, it maintains accurate predictions even when the proportion of outliers exceeds the threshold that a linear Transformer can tolerate.
- Score: 88.05636819649804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Mamba model has gained significant attention for its computational advantages over Transformer-based models, while achieving comparable performance across a wide range of language tasks. Like Transformers, Mamba exhibits in-context learning (ICL) capabilities, i.e., making predictions for new tasks based on a prompt containing input-label pairs and a query, without requiring fine-tuning. Despite its empirical success, the theoretical understanding of Mamba remains limited, largely due to the nonlinearity introduced by its gating mechanism. To the best of our knowledge, this paper presents the first theoretical analysis of the training dynamics of a one-layer Mamba model, which consists of a linear attention component followed by a nonlinear gating layer, and its ICL generalization on unseen binary classification tasks, even when the prompt includes additive outliers. Our analysis shows that Mamba leverages the linear attention layer to select informative context examples and uses the nonlinear gating layer to suppress the influence of outliers. By establishing and comparing to the analysis of linear Transformers under the same setting, we show that although Mamba may require more training iterations to converge, it maintains accurate predictions even when the proportion of outliers exceeds the threshold that a linear Transformer can tolerate. These theoretical findings are supported by empirical experiments.
Related papers
- RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models [8.049668552887505]
Mamba has recently garnered attention as an effective backbone for vision tasks.<n>We make three primary contributions to investigate Mamba's representational properties.<n>Our model achieves a 78.5 percent linear probing accuracy on ImageNet, underscoring its strong performance.
arXiv Detail & Related papers (2025-11-23T09:57:27Z) - Mamba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning [53.983686308399676]
Mamba is a proposed linear-time sequence model with strong empirical performance.<n>We study in-context learning of a single-index model $y approx g_*(langle boldsymbolbeta, boldsymbolx rangle)$.<n>We prove that Mamba, pretrained by gradient-based methods, can achieve efficient ICL via test-time feature learning.
arXiv Detail & Related papers (2025-10-14T00:21:20Z) - Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression [90.93281146423378]
Mamba is an efficient Transformer alternative with linear complexity for long-sequence modeling.<n>Recent empirical works demonstrate Mamba's in-context learning (ICL) competitive with Transformers.<n>This paper studies the training dynamics of Mamba on the linear regression ICL task.
arXiv Detail & Related papers (2025-09-28T09:48:49Z) - From Markov to Laplace: How Mamba In-Context Learns Markov Chains [36.22373318908893]
We study in-context learning on Markov chains and uncover a surprising phenomenon.<n>Unlike transformers, even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator.<n>These theoretical insights align strongly with empirical results and represent the first formal connection between Mamba and optimal statistical estimators.
arXiv Detail & Related papers (2025-02-14T14:13:55Z) - Rethinking Associative Memory Mechanism in Induction Head [37.93644115914534]
This paper investigates how a two-layer transformer thoroughly captures in-context information and balances it with pretrained bigram knowledge in next token prediction.<n>We theoretically analyze the representation of weight matrices in attention layers and the resulting logits when a transformer is given prompts generated by a bigram model.
arXiv Detail & Related papers (2024-12-16T05:33:05Z) - Demystify Mamba in Vision: A Linear Attention Perspective [72.93213667713493]
Mamba is an effective state space model with linear computation complexity.<n>We show that Mamba shares surprising similarities with linear attention Transformer.<n>We propose a Mamba-Inspired Linear Attention (MILA) model by incorporating the merits of these two key designs into linear attention.
arXiv Detail & Related papers (2024-05-26T15:31:09Z) - How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? [82.51626700527837]
Transformer-based large language models displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning.
We analyze how the mechanics of how Transformer to achieve ICL contribute to the technical challenges of the training problems in Transformers.
arXiv Detail & Related papers (2024-02-23T21:07:20Z) - Is Mamba Capable of In-Context Learning? [63.682741783013306]
State of the art foundation models such as GPT-4 perform surprisingly well at in-context learning (ICL)
This work provides empirical evidence that Mamba, a newly proposed state space model, has similar ICL capabilities.
arXiv Detail & Related papers (2024-02-05T16:39:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.