Related papers: Theoretical Foundations of Deep Selective State-Space Models

Theoretical Foundations of Deep Selective State-Space Models

URL: http://arxiv.org/abs/2402.19047v3
Date: Fri, 01 Nov 2024 13:28:59 GMT
Title: Theoretical Foundations of Deep Selective State-Space Models
Authors: Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, Terry Lyons,
Abstract summary: Deep SSMs demonstrate outstanding performance across a diverse set of domains. Recent developments show that if the linear recurrence powering SSMs allows for multiplicative interactions between inputs and hidden states. We show that when random linear recurrences are equipped with simple input-controlled transitions, then the hidden state is provably a low-dimensional projection of a powerful mathematical object.
Score: 13.971499161967083
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Structured state-space models (SSMs) such as S4, stemming from the seminal work of Gu et al., are gaining popularity as effective approaches for modeling sequential data. Deep SSMs demonstrate outstanding performance across a diverse set of domains, at a reduced training and inference cost compared to attention-based transformers. Recent developments show that if the linear recurrence powering SSMs allows for multiplicative interactions between inputs and hidden states (e.g. GateLoop, Mamba, GLA), then the resulting architecture can surpass in both in accuracy and efficiency attention-powered foundation models trained on text, at scales of billion parameters. In this paper, we give theoretical grounding to this recent finding using tools from Rough Path Theory: we show that when random linear recurrences are equipped with simple input-controlled transitions (selectivity mechanism), then the hidden state is provably a low-dimensional projection of a powerful mathematical object called the signature of the input -- capturing non-linear interactions between tokens at distinct timescales. Our theory not only motivates the success of modern selective state-space models such as Mamba but also provides a solid framework to understand the expressive power of future SSM variants.

Related papers

Scalable Language Models with Posterior Inference of Latent Thought Vectors [52.63299874322121]
Latent-Thought Language Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space. LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space. LTMs significantly outperform conventional autoregressive models and discrete diffusion models in validation perplexity and zero-shot language modeling.
arXiv Detail & Related papers (2025-02-03T17:50:34Z)
Recursive Learning of Asymptotic Variational Objectives [49.69399307452126]
General state-space models (SSMs) are widely used in statistical machine learning and are among the most classical generative models for sequential time-series data. Online sequential IWAE (OSIWAE) allows for online learning of both model parameters and a Markovian recognition model for inferring latent states. This approach is more theoretically well-founded than recently proposed online variational SMC methods.
arXiv Detail & Related papers (2024-11-04T16:12:37Z)
START: A Generalized State Space Model with Saliency-Driven Token-Aware Transformation [27.301312891532277]
Domain Generalization (DG) aims to enable models to generalize to unseen target domains by learning from multiple source domains. We propose START, which achieves state-of-the-art (SOTA) performances and offers a competitive alternative to CNNs and ViTs. Our START can selectively perturb and suppress domain-specific features in salient tokens within the input-dependent matrices of SSMs, thus effectively reducing the discrepancy between different domains.
arXiv Detail & Related papers (2024-10-21T13:50:32Z)
State-space models can learn in-context by gradient descent [1.3087858009942543]
We show that state-space models can perform gradient-based learning and use it for in-context learning in much the same way as transformers. Specifically, we prove that a single structured state-space model layer, augmented with multiplicative input and output gating, can reproduce the outputs of an implicit linear model. We also provide novel insights into the relationship between state-space models and linear self-attention, and their ability to learn in-context.
arXiv Detail & Related papers (2024-10-15T15:22:38Z)
Mathematical Formalism for Memory Compression in Selective State Space Models [0.0]
State space models (SSMs) have emerged as a powerful framework for modelling long-range dependencies in sequence data. We develop a rigorous mathematical framework for understanding memory compression in selective state space models. We show that selective SSMs offer significant improvements in memory efficiency and processing speed compared to traditional RNN-based models.
arXiv Detail & Related papers (2024-10-04T05:45:48Z)
Decision Mamba: A Multi-Grained State Space Model with Self-Evolution Regularization for Offline RL [57.202733701029594]
Decision Mamba is a novel multi-grained state space model with a self-evolving policy learning strategy. To mitigate the overfitting issue on noisy trajectories, a self-evolving policy is proposed by using progressive regularization. The policy evolves by using its own past knowledge to refine the suboptimal actions, thus enhancing its robustness on noisy demonstrations.
arXiv Detail & Related papers (2024-06-08T10:12:00Z)
The Expressive Capacity of State Space Models: A Formal Language Perspective [0.8948475969696075]
recurrent models based on linear state space models (SSMs) have shown promising performance in language modeling (LM), competititve with transformers. We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs.
arXiv Detail & Related papers (2024-05-27T17:46:57Z)
Spatiotemporal Implicit Neural Representation as a Generalized Traffic Data Learner [46.866240648471894]
Spatiotemporal Traffic Data (STTD) measures the complex dynamical behaviors of the multiscale transportation system. We present a novel paradigm to address the STTD learning problem by parameterizing STTD as an implicit neural representation. We validate its effectiveness through extensive experiments in real-world scenarios, showcasing applications from corridor to network scales.
arXiv Detail & Related papers (2024-05-06T06:23:06Z)
Synthetic location trajectory generation using categorical diffusion models [50.809683239937584]
Diffusion models (DPMs) have rapidly evolved to be one of the predominant generative models for the simulation of synthetic data. We propose using DPMs for the generation of synthetic individual location trajectories (ILTs) which are sequences of variables representing physical locations visited by individuals.
arXiv Detail & Related papers (2024-02-19T15:57:39Z)
State space models can express n-gram languages [51.823427608117626]
We build state space language models that can solve the next-word prediction task for languages generated from n-gram rules. Our proof shows how SSMs can encode n-gram rules using new theoretical results on their capacity. We conduct experiments with a small dataset generated from n-gram rules to show how our framework can be applied to SSMs and RNNs obtained through gradient-based optimization.
arXiv Detail & Related papers (2023-06-20T10:41:23Z)
Sparse Modular Activation for Efficient Sequence Modeling [94.11125833685583]
Recent models combining Linear State Space Models with self-attention mechanisms have demonstrated impressive results across a range of sequence modeling tasks. Current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. We introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely activate sub-modules for sequence elements in a differentiable manner.
arXiv Detail & Related papers (2023-06-19T23:10:02Z)
STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition [50.064502884594376]
We study the problem of human action recognition using motion capture (MoCap) sequences. We propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
arXiv Detail & Related papers (2023-03-31T16:19:27Z)
GEM: Group Enhanced Model for Learning Dynamical Control Systems [78.56159072162103]
We build effective dynamical models that are amenable to sample-based learning. We show that learning the dynamics on a Lie algebra vector space is more effective than learning a direct state transition model. This work sheds light on a connection between learning of dynamics and Lie group properties, which opens doors for new research directions.
arXiv Detail & Related papers (2021-04-07T01:08:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.