State-space models can learn in-context by gradient descent
- URL: http://arxiv.org/abs/2410.11687v1
- Date: Tue, 15 Oct 2024 15:22:38 GMT
- Title: State-space models can learn in-context by gradient descent
- Authors: Neeraj Mohan Sushma, Yudou Tian, Harshvardhan Mestha, Nicolo Colombo, David Kappel, Anand Subramoney,
- Abstract summary: This study demonstrates that state-space model architectures can perform gradient-based learning and use it for in-context learning.
We prove that a single structured state-space model layer, augmented with local self-attention, can reproduce the outputs of an implicit linear model.
The theoretical construction elucidates the role of local self-attention and multiplicative interactions in recurrent architectures as the key ingredients for enabling the expressive power typical of foundation models.
- Score: 1.3087858009942543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep state-space models (Deep SSMs) have shown capabilities for in-context learning on autoregressive tasks, similar to transformers. However, the architectural requirements and mechanisms enabling this in recurrent networks remain unclear. This study demonstrates that state-space model architectures can perform gradient-based learning and use it for in-context learning. We prove that a single structured state-space model layer, augmented with local self-attention, can reproduce the outputs of an implicit linear model with least squares loss after one step of gradient descent. Our key insight is that the diagonal linear recurrent layer can act as a gradient accumulator, which can be `applied' to the parameters of the implicit regression model. We validate our construction by training randomly initialized augmented SSMs on simple linear regression tasks. The empirically optimized parameters match the theoretical ones, obtained analytically from the implicit model construction. Extensions to multi-step linear and non-linear regression yield consistent results. The constructed SSM encompasses features of modern deep state-space models, with the potential for scalable training and effectiveness even in general tasks. The theoretical construction elucidates the role of local self-attention and multiplicative interactions in recurrent architectures as the key ingredients for enabling the expressive power typical of foundation models.
Related papers
- Exact Learning Dynamics of In-Context Learning in Linear Transformers and Its Application to Non-Linear Transformers [1.7034813545878589]
Transformer models exhibit remarkable in-context learning (ICL)
Our work offers an exact dynamical model for ICL and theoretically grounded tools for analyzing complex transformer training.
arXiv Detail & Related papers (2025-04-17T13:05:33Z) - In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.
Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z) - Re-examining learning linear functions in context [1.8843687952462742]
In-context learning (ICL) has emerged as a powerful paradigm for easily adapting Large Language Models (LLMs) to various tasks.
We explore a simple model of ICL in a controlled setup with synthetic training data.
Our findings challenge the prevailing narrative that transformers adopt algorithmic approaches to learn a linear function in-context.
arXiv Detail & Related papers (2024-11-18T10:58:46Z) - A domain decomposition-based autoregressive deep learning model for unsteady and nonlinear partial differential equations [2.7755345520127936]
We propose a domain-decomposition-based deep learning (DL) framework, named CoMLSim, for accurately modeling unsteady and nonlinear partial differential equations (PDEs)
The framework consists of two key components: (a) a convolutional neural network (CNN)-based autoencoder architecture and (b) an autoregressive model composed of fully connected layers.
arXiv Detail & Related papers (2024-08-26T17:50:47Z) - eXponential FAmily Dynamical Systems (XFADS): Large-scale nonlinear Gaussian state-space modeling [9.52474299688276]
We introduce a low-rank structured variational autoencoder framework for nonlinear state-space graphical models.
We show that our approach consistently demonstrates the ability to learn a more predictive generative model.
arXiv Detail & Related papers (2024-03-03T02:19:49Z) - Theoretical Foundations of Deep Selective State-Space Models [13.971499161967083]
Deep SSMs demonstrate outstanding performance across a diverse set of domains.
Recent developments show that if the linear recurrence powering SSMs allows for multiplicative interactions between inputs and hidden states.
We show that when random linear recurrences are equipped with simple input-controlled transitions, then the hidden state is provably a low-dimensional projection of a powerful mathematical object.
arXiv Detail & Related papers (2024-02-29T11:20:16Z) - SIP: Injecting a Structural Inductive Bias into a Seq2Seq Model by Simulation [75.14793516745374]
We show how a structural inductive bias can be efficiently injected into a seq2seq model by pre-training it to simulate structural transformations on synthetic data.
Our experiments show that our method imparts the desired inductive bias, resulting in better few-shot learning for FST-like tasks.
arXiv Detail & Related papers (2023-10-01T21:19:12Z) - Latent Traversals in Generative Models as Potential Flows [113.4232528843775]
We propose to model latent structures with a learned dynamic potential landscape.
Inspired by physics, optimal transport, and neuroscience, these potential landscapes are learned as physically realistic partial differential equations.
Our method achieves both more qualitatively and quantitatively disentangled trajectories than state-of-the-art baselines.
arXiv Detail & Related papers (2023-04-25T15:53:45Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning.
We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle.
In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - What learning algorithm is in-context learning? Investigations with
linear models [87.91612418166464]
We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly.
We show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression.
Preliminary evidence that in-context learners share algorithmic features with these predictors.
arXiv Detail & Related papers (2022-11-28T18:59:51Z) - State-driven Implicit Modeling for Sparsity and Robustness in Neural
Networks [3.604879434384177]
We present a new approach to training implicit models, called State-driven Implicit Modeling (SIM)
SIM constrains the internal states and outputs to match that of a baseline model, circumventing costly backward computations.
We demonstrate how the SIM approach can be applied to significantly improve sparsity and robustness of baseline models trained on datasets.
arXiv Detail & Related papers (2022-09-19T23:58:48Z) - Merging Two Cultures: Deep and Statistical Learning [3.15863303008255]
Merging the two cultures of deep and statistical learning provides insights into structured high-dimensional data.
We show that prediction, optimisation and uncertainty can be achieved using probabilistic methods at the output layer of the model.
arXiv Detail & Related papers (2021-10-22T02:57:21Z) - Sparse Flows: Pruning Continuous-depth Models [107.98191032466544]
We show that pruning improves generalization for neural ODEs in generative modeling.
We also show that pruning finds minimal and efficient neural ODE representations with up to 98% less parameters compared to the original network, without loss of accuracy.
arXiv Detail & Related papers (2021-06-24T01:40:17Z) - DAIS: Automatic Channel Pruning via Differentiable Annealing Indicator
Search [55.164053971213576]
convolutional neural network has achieved great success in fulfilling computer vision tasks despite large computation overhead.
Structured (channel) pruning is usually applied to reduce the model redundancy while preserving the network structure.
Existing structured pruning methods require hand-crafted rules which may lead to tremendous pruning space.
arXiv Detail & Related papers (2020-11-04T07:43:01Z) - Learning Variational Data Assimilation Models and Solvers [34.22350850350653]
We introduce end-to-end neural network architectures for data assimilation.
A key feature of the proposed end-to-end learning architecture is that we may train the NN models using both supervised and unsupervised strategies.
arXiv Detail & Related papers (2020-07-25T14:28:48Z) - Causality-aware counterfactual confounding adjustment for feature
representations learned by deep models [14.554818659491644]
Causal modeling has been recognized as a potential solution to many challenging problems in machine learning (ML)
We describe how a recently proposed counterfactual approach can still be used to deconfound the feature representations learned by deep neural network (DNN) models.
arXiv Detail & Related papers (2020-04-20T17:37:36Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.