The Illusion of State in State-Space Models
- URL: http://arxiv.org/abs/2404.08819v2
- Date: Tue, 4 Jun 2024 22:05:45 GMT
- Title: The Illusion of State in State-Space Models
- Authors: William Merrill, Jackson Petty, Ashish Sabharwal,
- Abstract summary: State-space models (SSMs) have emerged as a potential alternative architecture for building large language models.
We show that SSMs have similar limitations to non-recurrent models like transformers, which may limit their ability to solve real-world state-tracking problems.
- Score: 27.57426601905237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-space models (SSMs) have emerged as a potential alternative architecture for building large language models (LLMs) compared to the previously ubiquitous transformer architecture. One theoretical weakness of transformers is that they cannot express certain kinds of sequential computation and state tracking (Merrill & Sabharwal, 2023), which SSMs are explicitly designed to address via their close architectural similarity to recurrent neural networks (RNNs). But do SSMs truly have an advantage (over transformers) in expressive power for state tracking? Surprisingly, the answer is no. Our analysis reveals that the expressive power of SSMs is limited very similarly to transformers: SSMs cannot express computation outside the complexity class $\mathsf{TC}^0$. In particular, this means they cannot solve simple state-tracking problems like permutation composition. It follows that SSMs are provably unable to accurately track chess moves with certain notation, evaluate code, or track entities in a long narrative. To supplement our formal analysis, we report experiments showing that Mamba-style SSMs indeed struggle with state tracking. Thus, despite its recurrent formulation, the "state" in an SSM is an illusion: SSMs have similar expressiveness limitations to non-recurrent models like transformers, which may fundamentally limit their ability to solve real-world state-tracking problems.
Related papers
- Provable Benefits of Complex Parameterizations for Structured State Space Models [51.90574950170374]
Structured state space models (SSMs) are linear dynamical systems adhering to a specified structure.
In contrast to typical neural network modules, whose parameterizations are real, SSMs often use complex parameterizations.
This paper takes a step towards explaining the benefits of complex parameterizations for SSMs by establishing formal gaps between real and complex diagonal SSMs.
arXiv Detail & Related papers (2024-10-17T22:35:50Z) - On the Adversarial Transferability of Generalized "Skip Connections" [83.71752155227888]
Skip connection is an essential ingredient for modern deep models to be deeper and more powerful.
We find that using more gradients from the skip connections rather than the residual modules during backpropagation allows one to craft adversarial examples with high transferability.
We conduct comprehensive transfer attacks against various models including ResNets, Transformers, Inceptions, Neural Architecture Search, and Large Language Models.
arXiv Detail & Related papers (2024-10-11T16:17:47Z) - Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality [31.985243136674146]
We show that state-space models (SSMs) such as Mamba have been shown to match or outperform Transformers at small to medium scale.
Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster.
arXiv Detail & Related papers (2024-05-31T17:50:01Z) - State Space Models are Comparable to Transformers in Estimating Functions with Dynamic Smoothness [41.617269918948686]
Deep neural networks based on state space models (SSMs) are attracting much attention in sequence modeling.
This paper theoretically explores in which tasks SSMs can be alternatives of Transformers from the perspective of estimating sequence-to-sequence functions.
We prove that SSMs can estimate the target function, even if the smoothness changes depending on the input sequence, as well as Transformers.
arXiv Detail & Related papers (2024-05-29T12:23:48Z) - The Expressive Capacity of State Space Models: A Formal Language Perspective [0.8948475969696075]
recurrent models based on linear state space models (SSMs) have shown promising performance in language modeling (LM), competititve with transformers.
We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs.
arXiv Detail & Related papers (2024-05-27T17:46:57Z) - Meanings and Feelings of Large Language Models: Observability of Latent States in Generative AI [65.04274914674771]
We show that current Large Language Models (LLMs) cannot have 'feelings' according to the American Psychological Association (APA)
Our analysis sheds light on possible designs that would enable a model to perform non-trivial computation that is not visible to the user.
arXiv Detail & Related papers (2024-05-22T23:18:58Z) - Theoretical Foundations of Deep Selective State-Space Models [13.971499161967083]
Deep SSMs demonstrate outstanding performance across a diverse set of domains.
Recent developments show that if the linear recurrence powering SSMs allows for multiplicative interactions between inputs and hidden states.
We show that when random linear recurrences are equipped with simple input-controlled transitions, then the hidden state is provably a low-dimensional projection of a powerful mathematical object.
arXiv Detail & Related papers (2024-02-29T11:20:16Z) - Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z) - Statistically Meaningful Approximation: a Case Study on Approximating
Turing Machines with Transformers [50.85524803885483]
This work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability.
We study SM approximation for two function classes: circuits and Turing machines.
arXiv Detail & Related papers (2021-07-28T04:28:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.