Related papers: StateX: Enhancing RNN Recall via Post-training State Expansion

StateX: Enhancing RNN Recall via Post-training State Expansion

URL: http://arxiv.org/abs/2509.22630v1
Date: Fri, 26 Sep 2025 17:55:22 GMT
Title: StateX: Enhancing RNN Recall via Post-training State Expansion
Authors: Xingyu Shen, Yingfa Chen, Zhen Leng Thai, Xu Han, Zhiyuan Liu, Maosong Sun,
Abstract summary: We introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training.<n>Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities.
Score: 48.96665606047916
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: While Transformer-based models have demonstrated remarkable language modeling performance, their high complexities result in high costs when processing long contexts. In contrast, recurrent neural networks (RNNs) such as linear attention and state space models have gained popularity due to their constant per-token complexities. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a constant-size recurrent state. Previous works have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with larger recurrent states results in high training costs. In this paper, we introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training. For two popular classes of RNNs, linear attention and state space models, we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters. Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities.

Related papers

Implicit Language Models are RNNs: Balancing Parallelization and Expressivity [4.332158627306896]
State-space models (SSMs) and transformers dominate the language modeling landscape.<n>We propose implicit SSMs, which implement a transformation until convergence to a fixed point.<n>Our approach demonstrates superior state-tracking capabilities on regular languages, surpassing transformers and SSMs.
arXiv Detail & Related papers (2025-02-10T19:59:31Z)
GhostRNN: Reducing State Redundancy in RNN with Cheap Operations [66.14054138609355]
We propose an efficient RNN architecture, GhostRNN, which reduces hidden state redundancy with cheap operations. Experiments on KWS and SE tasks demonstrate that the proposed GhostRNN significantly reduces the memory usage (40%) and computation cost while keeping performance similar.
arXiv Detail & Related papers (2024-11-20T11:37:14Z)
Stuffed Mamba: Oversized States Lead to the Inability to Forget [53.512358993801115]
We show that Mamba-based models struggle to effectively forget earlier tokens even with built-in forgetting mechanisms.<n>We show that the minimum training length required for the model to learn forgetting scales linearly with the state size, and the maximum context length for accurate retrieval of a 5-digit passkey scales exponentially with the state size.<n>Our work suggests that future RNN designs must account for the interplay between state size, training length, and forgetting mechanisms to achieve robust performance in long-context tasks.
arXiv Detail & Related papers (2024-10-09T17:54:28Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
State-driven Implicit Modeling for Sparsity and Robustness in Neural Networks [3.604879434384177]
We present a new approach to training implicit models, called State-driven Implicit Modeling (SIM) SIM constrains the internal states and outputs to match that of a baseline model, circumventing costly backward computations. We demonstrate how the SIM approach can be applied to significantly improve sparsity and robustness of baseline models trained on datasets.
arXiv Detail & Related papers (2022-09-19T23:58:48Z)
EGRU: Event-based GRU for activity-sparse inference and learning [0.8260432715157026]
We propose a model that reformulates Gated Recurrent Units (GRU) as an event-based activity-sparse model. We show that the Event-based GRU (EGRU) demonstrates competitive performance compared to state-of-the-art recurrent network models in real-world tasks.
arXiv Detail & Related papers (2022-06-13T14:07:56Z)
Least Redundant Gated Recurrent Neural Network [0.0]
We introduce a recurrent neural architecture called Deep Memory Update (DMU) It is based on updating the previous memory state with a deep transformation of the lagged state and the network input. Its training is stable and fast due to relating its learning rate to the size of the module.
arXiv Detail & Related papers (2021-05-28T20:24:00Z)
Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR) We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
Recognizing Long Grammatical Sequences Using Recurrent Networks Augmented With An External Differentiable Stack [73.48927855855219]
Recurrent neural networks (RNNs) are a widely used deep architecture for sequence modeling, generation, and prediction. RNNs generalize poorly over very long sequences, which limits their applicability to many important temporal processing and time series forecasting problems. One way to address these shortcomings is to couple an RNN with an external, differentiable memory structure, such as a stack. In this paper, we improve the memory-augmented RNN with important architectural and state updating mechanisms.
arXiv Detail & Related papers (2020-04-04T14:19:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.