Related papers: Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

URL: http://arxiv.org/abs/2501.00658v1
Date: Tue, 31 Dec 2024 22:06:39 GMT
Title: Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing
Authors: Peihao Wang, Ruisi Cai, Yuehao Wang, Jiajun Zhu, Pragya Srivastava, Zhangyang Wang, Pan Li,
Abstract summary: We show that Structured State Space Models (SSMs) are inherently limited by strong recency bias.<n>This bias impairs the models' ability to recall distant information and introduces robustness issues.<n>We propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing.
Score: 56.66469232740998
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that they are inherently limited by strong recency bias. Our empirical studies also reveal that this bias impairs the models' ability to recall distant information and introduces robustness issues. Our scaling experiments then discovered that deeper structures in SSMs can facilitate the learning of long contexts. However, subsequent theoretical analysis reveals that as SSMs increase in depth, they exhibit another inevitable tendency toward over-smoothing, e.g., token representations becoming increasingly indistinguishable. This fundamental dilemma between recency and over-smoothing hinders the scalability of existing SSMs. Inspired by our theoretical findings, we propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing. Experiments demonstrate that our polarization technique consistently enhances the associative recall accuracy of long-range tokens and unlocks SSMs to benefit further from deeper architectures. All source codes are released at https://github.com/VITA-Group/SSM-Bottleneck.

Related papers

Generalization Error Analysis for Selective State-Space Models Through the Lens of Attention [2.8998926117101367]
State-space models (SSMs) are a new class of foundation models that have emerged as a compelling alternative to Transformers. This paper provides a detailed theoretical analysis of selective SSMs, the core components of the Mamba and Mamba-2 architectures.
arXiv Detail & Related papers (2025-02-03T16:05:31Z)
MVICAD2: Multi-View Independent Component Analysis with Delays and Dilations [61.59658203704757]
We propose Multi-View Independent Component Analysis with Delays and Dilations (MVICAD2), which allows sources to differ across subjects in both temporal delays and dilations. We present a model with identifiable sources, derive an approximation of its likelihood in closed form, and use regularization and optimization techniques to enhance performance.
arXiv Detail & Related papers (2025-01-13T15:47:02Z)
Provable Benefits of Complex Parameterizations for Structured State Space Models [51.90574950170374]
Structured state space models (SSMs) are linear dynamical systems adhering to a specified structure. In contrast to typical neural network modules, whose parameterizations are real, SSMs often use complex parameterizations. This paper takes a step towards explaining the benefits of complex parameterizations for SSMs by establishing formal gaps between real and complex diagonal SSMs.
arXiv Detail & Related papers (2024-10-17T22:35:50Z)
Exploring Adversarial Robustness of Deep State Space Models [26.650751659034782]
Adversarial Training (AT) is a mainstream approach to enhancing Adversarial Robustness (AR) We show that pure SSM structures struggle to benefit from AT, whereas incorporating Attention yields a markedly better trade-off between robustness and generalization. We propose a simple and effective Adaptive Scaling (AdS) mechanism that brings AT performance close to Attention-integrated SSMs without introducing the issue of Robust Overfitting (RO)
arXiv Detail & Related papers (2024-06-08T17:25:48Z)
SMR: State Memory Replay for Long Sequence Modeling [19.755738298836526]
This paper proposes a novel non-recursive non-uniform sample processing strategy to overcome compatibility limitations in parallel convolutional computation. We introduce State Memory Replay (SMR), which utilizes learnable memories to adjust the current state with multi-step information for generalization at sampling points different from those in the training data. Experiments on long-range modeling tasks in autoregressive language modeling and Long Range Arena demonstrate the general effectiveness of the SMR mechanism for a series of SSM models.
arXiv Detail & Related papers (2024-05-27T17:53:32Z)
The Expressive Capacity of State Space Models: A Formal Language Perspective [0.8948475969696075]
recurrent models based on linear state space models (SSMs) have shown promising performance in language modeling (LM), competititve with transformers. We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs.
arXiv Detail & Related papers (2024-05-27T17:46:57Z)
The Illusion of State in State-Space Models [27.57426601905237]
State-space models (SSMs) have emerged as a potential alternative architecture for building large language models. We show that SSMs have similar limitations to non-recurrent models like transformers, which may limit their ability to solve real-world state-tracking problems.
arXiv Detail & Related papers (2024-04-12T21:30:06Z)
SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding [50.337896542603524]
We introduce SpikeMba: a multi-modal spiking saliency mamba for temporal video grounding. Our approach integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages. Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2024-04-01T15:26:44Z)
Neuro-mimetic Task-free Unsupervised Online Learning with Continual Self-Organizing Maps [56.827895559823126]
Self-organizing map (SOM) is a neural model often used in clustering and dimensionality reduction. We propose a generalization of the SOM, the continual SOM, which is capable of online unsupervised learning under a low memory budget. Our results, on benchmarks including MNIST, Kuzushiji-MNIST, and Fashion-MNIST, show almost a two times increase in accuracy.
arXiv Detail & Related papers (2024-02-19T19:11:22Z)
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information.<n>This study empirically evaluates the forgetting phenomenon in large language models during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.