Related papers: Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling

Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling

URL: http://arxiv.org/abs/2410.07145v1
Date: Wed, 9 Oct 2024 17:54:28 GMT
Title: Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling
Authors: Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun,
Abstract summary: We study the cause of the inability to process long context for RNNs and suggest critical mitigations. We first investigate *state collapse* (SC), a phenomenon that causes severe performance degradation on sequence lengths not encountered during training. We train a series of Mamba-2 models on long documents to empirically estimate the recurrent state capacity in language modeling and passkey retrieval.
Score: 69.36377985746878
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: One essential advantage of recurrent neural networks (RNNs) over transformer-based language models is their linear computational complexity concerning the sequence length, which makes them much faster in handling long sequences during inference. However, most publicly available RNNs (e.g., Mamba and RWKV) are trained on sequences with less than 10K tokens, and their effectiveness in longer contexts remains largely unsatisfying so far. In this paper, we study the cause of the inability to process long context for RNNs and suggest critical mitigations. We examine two practical concerns when applying state-of-the-art RNNs to long contexts: (1) the inability to extrapolate to inputs longer than the training length and (2) the upper bound of memory capacity. Addressing the first concern, we first investigate *state collapse* (SC), a phenomenon that causes severe performance degradation on sequence lengths not encountered during training. With controlled experiments, we attribute this to overfitting due to the recurrent state being overparameterized for the training length. For the second concern, we train a series of Mamba-2 models on long documents to empirically estimate the recurrent state capacity in language modeling and passkey retrieval. Then, three SC mitigation methods are proposed to improve Mamba-2's length generalizability, allowing the model to process more than 1M tokens without SC. We also find that the recurrent state capacity in passkey retrieval scales exponentially to the state size, and we empirically train a Mamba-2 370M with near-perfect passkey retrieval accuracy on 256K context length. This suggests a promising future for RNN-based long-context modeling.

Related papers

ResFormer: All-Time Reservoir Memory for Long Sequence Classification [4.298381633106637]
Sequence classification is essential in NLP for understanding and categorizing language patterns in tasks like sentiment analysis, intent detection, and topic classification.<n> Transformer-based models, despite achieving state-of-the-art performance, have inherent limitations due to quadratic time and memory complexity.<n>We propose ResFormer, a novel neural network architecture designed to model varying context lengths efficiently through a cascaded methodology.
arXiv Detail & Related papers (2025-09-28T21:20:49Z)
StateX: Enhancing RNN Recall via Post-training State Expansion [48.96665606047916]
We introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training.<n>Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities.
arXiv Detail & Related papers (2025-09-26T17:55:22Z)
Quantifying Memory Utilization with Effective State-Size [73.52115209375343]
We develop a measure of textitmemory utilization'<n>This metric is tailored to the fundamental class of systems with textitinput-invariant and textitinput-varying linear operators
arXiv Detail & Related papers (2025-04-28T08:12:30Z)
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models [72.75501495786297]
We introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture. Experimental results show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art DeepSeek R1 distilled reasoning models.
arXiv Detail & Related papers (2025-04-14T17:38:25Z)
Token Weighting for Long-Range Language Modeling [50.2371550397256]
We propose novel token-weighting schemes that assign different weights to each training token in the loss.<n>We evaluate all methods on multiple long-context understanding tasks and show that non-uniform loss weights are helpful.<n>This work contributes to a better understanding of the trade-offs long-context language modeling faces.
arXiv Detail & Related papers (2025-03-12T09:46:59Z)
RecurFormer: Not All Transformer Heads Need Self-Attention [14.331807060659902]
Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference. We propose RecurFormer, a novel architecture that replaces certain attention heads with linear recurrent neural networks.
arXiv Detail & Related papers (2024-10-10T15:24:12Z)
Were RNNs All We Needed? [53.393497486332]
We revisit traditional recurrent neural networks (RNNs) from over a decade ago. We show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel.
arXiv Detail & Related papers (2024-10-02T03:06:49Z)
Learning to (Learn at Test Time): RNNs with Expressive Hidden States [69.78469963604063]
We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training layers.
arXiv Detail & Related papers (2024-07-05T16:23:20Z)
Knowledge Translation: A New Pathway for Model Compression [22.106103818486144]
TextbfKnowledge textbfTranslation (KT) A translation'' model is trained to receive the parameters of a larger model and generate compressed parameters. We propose a comprehensive framework for KT, introduce data augmentation strategies to enhance model performance despite restricted training data, and successfully demonstrate the feasibility of KT on the MNIST dataset.
arXiv Detail & Related papers (2024-01-11T09:25:42Z)
On the Computational Complexity and Formal Hierarchy of Second Order Recurrent Neural Networks [59.85314067235965]
We extend the theoretical foundation for the $2nd$-order recurrent network ($2nd$ RNN) We prove there exists a class of a $2nd$ RNN that is Turing-complete with bounded time. We also demonstrate that $2$nd order RNNs, without memory, outperform modern-day models such as vanilla RNNs and gated recurrent units in recognizing regular grammars.
arXiv Detail & Related papers (2023-09-26T06:06:47Z)
INK: Injecting kNN Knowledge in Nearest Neighbor Machine Translation [57.952478914459164]
kNN-MT has provided an effective paradigm to smooth the prediction based on neighbor representations during inference. We propose an effective training framework INK to directly smooth the representation space via adjusting representations of kNN neighbors with a small number of new parameters. Experiments on four benchmark datasets show that method achieves average gains of 1.99 COMET and 1.0 BLEU, outperforming the state-of-the-art kNN-MT system with 0.02x memory space and 1.9x inference speedup.
arXiv Detail & Related papers (2023-06-10T08:39:16Z)
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition. We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)
SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks [21.616328837090396]
Spiking Neural Networks (SNNs) leverage sparse and event-driven activations to reduce the computational overhead associated with model inference. We implement generative language model with binary, event-driven spiking activation units. SpikeGPT is the largest backpropagation-trained SNN model to date, rendering it suitable for both the generation and comprehension of natural language.
arXiv Detail & Related papers (2023-02-27T16:43:04Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Recurrent Neural Networks for Learning Long-term Temporal Dependencies with Reanalysis of Time Scale Representation [16.32068729107421]
We argue that the interpretation of a forget gate as a temporal representation is valid when the gradient of loss with respect to the state decreases exponentially as time goes back. We propose an approach to construct new RNNs that can represent a longer time scale than conventional models.
arXiv Detail & Related papers (2021-11-05T06:22:58Z)
Recognizing Long Grammatical Sequences Using Recurrent Networks Augmented With An External Differentiable Stack [73.48927855855219]
Recurrent neural networks (RNNs) are a widely used deep architecture for sequence modeling, generation, and prediction. RNNs generalize poorly over very long sequences, which limits their applicability to many important temporal processing and time series forecasting problems. One way to address these shortcomings is to couple an RNN with an external, differentiable memory structure, such as a stack. In this paper, we improve the memory-augmented RNN with important architectural and state updating mechanisms.
arXiv Detail & Related papers (2020-04-04T14:19:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.