Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression
- URL: http://arxiv.org/abs/2509.23779v1
- Date: Sun, 28 Sep 2025 09:48:49 GMT
- Title: Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression
- Authors: Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, Liqiang Nie,
- Abstract summary: Mamba is an efficient Transformer alternative with linear complexity for long-sequence modeling.<n>Recent empirical works demonstrate Mamba's in-context learning (ICL) competitive with Transformers.<n>This paper studies the training dynamics of Mamba on the linear regression ICL task.
- Score: 90.93281146423378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-space models (SSMs), particularly Mamba, emerge as an efficient Transformer alternative with linear complexity for long-sequence modeling. Recent empirical works demonstrate Mamba's in-context learning (ICL) capabilities competitive with Transformers, a critical capacity for large foundation models. However, theoretical understanding of Mamba's ICL remains limited, restricting deeper insights into its underlying mechanisms. Even fundamental tasks such as linear regression ICL, widely studied as a standard theoretical benchmark for Transformers, have not been thoroughly analyzed in the context of Mamba. To address this gap, we study the training dynamics of Mamba on the linear regression ICL task. By developing novel techniques tackling non-convex optimization with gradient descent related to Mamba's structure, we establish an exponential convergence rate to ICL solution, and derive a loss bound that is comparable to Transformer's. Importantly, our results reveal that Mamba can perform a variant of \textit{online gradient descent} to learn the latent function in context. This mechanism is different from that of Transformer, which is typically understood to achieve ICL through gradient descent emulation. The theoretical results are verified by experimental simulation.
Related papers
- Transformer-Progressive Mamba Network for Lightweight Image Super-Resolution [45.74812546007778]
Mamba-based super-resolution (SR) methods have demonstrated the ability to capture global receptive fields with linear complexity.<n>We propose T-PMambaSR, a lightweight SR framework that integrates window-based self-attention with Progressive Mamba.
arXiv Detail & Related papers (2025-11-05T06:46:17Z) - Mamba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning [53.983686308399676]
Mamba is a proposed linear-time sequence model with strong empirical performance.<n>We study in-context learning of a single-index model $y approx g_*(langle boldsymbolbeta, boldsymbolx rangle)$.<n>We prove that Mamba, pretrained by gradient-based methods, can achieve efficient ICL via test-time feature learning.
arXiv Detail & Related papers (2025-10-14T00:21:20Z) - Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis [88.05636819649804]
The Mamba model has gained significant attention for its computational advantages over Transformer-based models.<n>This paper presents the first theoretical analysis of the training dynamics of a one-layer Mamba model.<n>We show that although Mamba may require more training to converge, it maintains accurate predictions even when the proportion of outliers exceeds the threshold that a linear Transformer can tolerate.
arXiv Detail & Related papers (2025-10-01T01:25:01Z) - Differential Mamba [16.613266337054267]
Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations.<n>Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications.<n>We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications.
arXiv Detail & Related papers (2025-07-08T17:30:14Z) - Probing In-Context Learning: Impact of Task Complexity and Model Architecture on Generalization and Efficiency [10.942999793311765]
We investigate in-context learning (ICL) through a meticulous experimental framework that systematically varies task complexity and model architecture.<n>We evaluate four distinct models: a GPT2-style Transformer, a Transformer with FlashAttention mechanism, a convolutional Hyena-based model, and the Mamba state-space model.
arXiv Detail & Related papers (2025-05-10T00:22:40Z) - TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba [66.80624029365448]
We propose a cross-architecture knowledge transfer paradigm, TransMamba, that facilitates the reuse of Transformer pre-trained knowledge.<n>We propose a two-stage framework to accelerate the training of Mamba-based models, ensuring their effectiveness across both uni-modal and multi-modal tasks.
arXiv Detail & Related papers (2025-02-21T01:22:01Z) - From Markov to Laplace: How Mamba In-Context Learns Markov Chains [36.22373318908893]
We study in-context learning on Markov chains and uncover a surprising phenomenon.<n>Unlike transformers, even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator.<n>These theoretical insights align strongly with empirical results and represent the first formal connection between Mamba and optimal statistical estimators.
arXiv Detail & Related papers (2025-02-14T14:13:55Z) - Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? [82.51626700527837]
Transformer-based large language models displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning.
We analyze how the mechanics of how Transformer to achieve ICL contribute to the technical challenges of the training problems in Transformers.
arXiv Detail & Related papers (2024-02-23T21:07:20Z) - Is Mamba Capable of In-Context Learning? [63.682741783013306]
State of the art foundation models such as GPT-4 perform surprisingly well at in-context learning (ICL)
This work provides empirical evidence that Mamba, a newly proposed state space model, has similar ICL capabilities.
arXiv Detail & Related papers (2024-02-05T16:39:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.