Rethinking the long-range dependency in Mamba/SSM and transformer models
- URL: http://arxiv.org/abs/2509.04226v1
- Date: Thu, 04 Sep 2025 13:56:47 GMT
- Title: Rethinking the long-range dependency in Mamba/SSM and transformer models
- Authors: Cong Ma, Kayvan Najarian,
- Abstract summary: We mathematically define long-range dependency using the derivative of hidden states with respect to past inputs.<n>We show that the long-range dependency of SSM decays exponentially with the sequence length, which aligns with the exponential decay of memory function in RNN.<n>We propose a new formulation for hidden state update in SSM and prove its stability under a standard Gaussian distribution of the input data.
- Score: 4.7663374197637465
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Long-range dependency is one of the most desired properties of recent sequence models such as state-space models (particularly Mamba) and transformer models. New model architectures are being actively developed and benchmarked for prediction tasks requiring long-range dependency. However, the capability of modeling long-range dependencies of these models has not been investigated from a theoretical perspective, which hinders a systematic improvement on this aspect. In this work, we mathematically define long-range dependency using the derivative of hidden states with respect to past inputs and compare the capability of SSM and transformer models of modeling long-range dependency based on this definition. We showed that the long-range dependency of SSM decays exponentially with the sequence length, which aligns with the exponential decay of memory function in RNN. But the attention mechanism used in transformers is more flexible and is not constrained to exponential decay, which could in theory perform better at modeling long-range dependency with sufficient training data, computing resources, and proper training. To combine the flexibility of long-range dependency of attention mechanism and computation efficiency of SSM, we propose a new formulation for hidden state update in SSM and prove its stability under a standard Gaussian distribution of the input data.
Related papers
- MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling [60.648359990090846]
State-space models (SSMs) have recently attention as an efficient alternative to computationally expensive attention-based models for sequence modeling.<n>This paper introduces a multi-scale SSM framework that represents sequence dynamics across multiple resolution and processing each resolution with specialized state-space dynamics.
arXiv Detail & Related papers (2025-12-29T19:36:28Z) - Oscillatory State-Space Models [61.923849241099184]
We propose Lineary State-Space models (LinOSS) for efficiently learning on long sequences.<n>A stable discretization, integrated over time using fast associative parallel scans, yields the proposed state-space model.<n>We show that LinOSS is universal, i.e., it can approximate any continuous and causal operator mapping between time-varying functions.
arXiv Detail & Related papers (2024-10-04T22:00:13Z) - Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need [28.301119776877822]
Time series forecasting requires balancing short-term and long-term dependencies for accurate predictions.
Transformers are superior in modeling long-term dependencies but are criticized for their quadratic computational cost.
Mamba provides a near-linear alternative but is reported less effective in time series longterm forecasting due to potential information loss.
arXiv Detail & Related papers (2024-08-28T17:59:27Z) - SDE: A Simplified and Disentangled Dependency Encoding Framework for State Space Models in Time Series Forecasting [8.841699904757506]
We identify and formally define three critical dependencies that are fundamental to forecasting accuracy.<n>We propose SDE (Simplified and Disentangled Dependency entangle), a novel framework designed to enhance the capability of SSMs for time series forecasting.
arXiv Detail & Related papers (2024-08-22T02:14:59Z) - CMamba: Channel Correlation Enhanced State Space Models for Multivariate Time Series Forecasting [18.50360049235537]
Mamba, a state space model, has emerged with robust sequence and feature mixing capabilities.
Capturing cross-channel dependencies is critical in enhancing performance of time series prediction.
We introduce a refined Mamba variant tailored for time series forecasting.
arXiv Detail & Related papers (2024-06-08T01:32:44Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Rough Transformers for Continuous and Efficient Time-Series Modelling [46.58170057001437]
Time-series data in real-world medical settings typically exhibit long-range dependencies and are observed at non-uniform intervals.
We introduce the Rough Transformer, a variation of the Transformer model which operates on continuous-time representations of input sequences.
We find that Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the benefits of Neural ODE-based models.
arXiv Detail & Related papers (2024-03-15T13:29:45Z) - FCDNet: Frequency-Guided Complementary Dependency Modeling for
Multivariate Time-Series Forecasting [9.083469629116784]
We propose FCDNet, a concise yet effective framework for time-series forecasting.
It helps extract long- and short-term dependency information adaptively from multi-level frequency patterns.
Experiments show that FCDNet significantly exceeds strong baselines.
arXiv Detail & Related papers (2023-12-27T07:29:52Z) - Closed-form Continuous-Depth Models [99.40335716948101]
Continuous-depth neural models rely on advanced numerical differential equation solvers.
We present a new family of models, termed Closed-form Continuous-depth (CfC) networks, that are simple to describe and at least one order of magnitude faster.
arXiv Detail & Related papers (2021-06-25T22:08:51Z) - Transformer Hawkes Process [79.16290557505211]
We propose a Transformer Hawkes Process (THP) model, which leverages the self-attention mechanism to capture long-term dependencies.
THP outperforms existing models in terms of both likelihood and event prediction accuracy by a notable margin.
We provide a concrete example, where THP achieves improved prediction performance for learning multiple point processes when incorporating their relational information.
arXiv Detail & Related papers (2020-02-21T13:48:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.