Memory Determines Learning Direction: A Theory of Gradient-Based Optimization in State Space Models
- URL: http://arxiv.org/abs/2510.00563v1
- Date: Wed, 01 Oct 2025 06:30:42 GMT
- Title: Memory Determines Learning Direction: A Theory of Gradient-Based Optimization in State Space Models
- Authors: JingChuan Guan, Tomoyuki Kubota, Yasuo Kuniyoshi, Kohei Nakajima,
- Abstract summary: State space models (SSMs) have gained attention by showing potential to outperform Transformers.<n>In this study, we provide such an explanation and propose an improved training strategy.
- Score: 2.6599014990168834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State space models (SSMs) have gained attention by showing potential to outperform Transformers. However, previous studies have not sufficiently addressed the mechanisms underlying their high performance owing to a lack of theoretical explanation of SSMs' learning dynamics. In this study, we provide such an explanation and propose an improved training strategy. The memory capacity of SSMs can be evaluated by examining how input time series are stored in their current state. Such an examination reveals a tradeoff between memory accuracy and length, as well as the theoretical equivalence between the structured state space sequence model (S4) and a simplified S4 with diagonal recurrent weights. This theoretical foundation allows us to elucidate the learning dynamics, proving the importance of initial parameters. Our analytical results suggest that successful learning requires the initial memory structure to be the longest possible even if memory accuracy may deteriorate or the gradient lose the teacher information. Experiments on tasks requiring long memory confirmed that extending memory is difficult, emphasizing the importance of initialization. Furthermore, we found that fixing recurrent weights can be more advantageous than adapting them because it achieves comparable or even higher performance with faster convergence. Our results provide a new theoretical foundation for SSMs and potentially offer a novel optimization strategy.
Related papers
- The Curious Case of In-Training Compression of State Space Models [49.819321766705514]
State Space Models (SSMs) tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference.<n>Key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden.<n>Our approach, textscCompreSSM, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models.
arXiv Detail & Related papers (2025-10-03T09:02:33Z) - Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling [60.63703438729223]
We show how different architectures and training methods affect model multi-step reasoning capabilities.<n>We confirm that increasing model depth plays a crucial role for sequential computations.
arXiv Detail & Related papers (2025-08-22T18:57:08Z) - Emergence of the Primacy Effect in Structured State-Space Models [0.35534933448684125]
Structured state-space models (SSMs) have been developed to offer more persistent memory retention than traditional recurrent neural networks.<n>The memory mechanism of canonical SSMs is theoretically designed to decay monotonically over time.<n>The present study reveals a counterintuitive finding: when trained and evaluated on a synthetic, statistically balanced memorization task, SSMs predominantly preserve the *initially* presented data in memory.
arXiv Detail & Related papers (2025-02-19T13:55:32Z) - Forget Forgetting: Continual Learning in a World of Abundant Memory [55.64184779530581]
Continual learning has traditionally focused on minimizing exemplar memory.<n>This paper challenges this paradigm by investigating a more realistic regime.<n>We find that the core challenge shifts from stability to plasticity, as models become biased toward prior tasks and struggle to learn new ones.
arXiv Detail & Related papers (2025-02-11T05:40:52Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning [64.93848182403116]
Current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term.
We introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents.
Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks.
arXiv Detail & Related papers (2024-10-14T03:50:17Z) - Mathematical Formalism for Memory Compression in Selective State Space Models [0.0]
State space models (SSMs) have emerged as a powerful framework for modelling long-range dependencies in sequence data.
We develop a rigorous mathematical framework for understanding memory compression in selective state space models.
We show that selective SSMs offer significant improvements in memory efficiency and processing speed compared to traditional RNN-based models.
arXiv Detail & Related papers (2024-10-04T05:45:48Z) - Fine-Grained Gradient Restriction: A Simple Approach for Mitigating Catastrophic Forgetting [41.891312602770746]
Gradient Episodic Memory (GEM) achieves balance by utilizing a subset of past training samples to restrict the update direction of the model parameters.
We show that memory strength is effective mainly because it improves GEM's ability generalization and therefore leads to a more favorable trade-off.
arXiv Detail & Related papers (2024-10-01T17:03:56Z) - HOPE for a Robust Parameterization of Long-memory State Space Models [51.66430224089725]
State-space models (SSMs) that utilize linear, time-invariant (LTI) systems are known for their effectiveness in learning long sequences.
We develop a new parameterization scheme, called HOPE, for LTI systems that utilize Markov parameters within Hankel operators.
Our new parameterization endows the SSM with non-decaying memory within a fixed time window, which is empirically corroborated by a sequential CIFAR-10 task with padded noise.
arXiv Detail & Related papers (2024-05-22T20:20:14Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Sequential Memory with Temporal Predictive Coding [6.228559238589584]
We propose a PC-based model for emphsequential memory, called emphtemporal predictive coding (tPC)
We show that our tPC models can memorize and retrieve sequential inputs accurately with a biologically plausible neural implementation.
arXiv Detail & Related papers (2023-05-19T20:03:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.