Related papers: Understanding and Improving Length Generalization in Recurrent Models

Understanding and Improving Length Generalization in Recurrent Models

URL: http://arxiv.org/abs/2507.02782v1
Date: Thu, 03 Jul 2025 16:45:50 GMT
Title: Understanding and Improving Length Generalization in Recurrent Models
Authors: Ricardo Buitrago Ruiz, Albert Gu,
Abstract summary: recurrent models can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths.<n>We show that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all attainable states.<n>We investigate simple training interventions that aim to increase the coverage of the states that the model is trained on.
Score: 16.642157805072042
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, recurrent models such as state space models and linear attention have become popular due to their linear complexity in the sequence length. Thanks to their recurrent nature, in principle they can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths-i.e. they fail to length generalize. In this work, we provide comprehensive empirical and theoretical analysis to support the unexplored states hypothesis, which posits that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all attainable states (i.e. states that would be attained if the recurrence was applied to long sequences). Furthermore, we investigate simple training interventions that aim to increase the coverage of the states that the model is trained on, e.g. by initializing the state with Gaussian noise or with the final state of a different input sequence. With only 500 post-training steps ($\sim 0.1\%$ of the pre-training budget), these interventions enable length generalization for sequences that are orders of magnitude longer than the training context (e.g. $2k\longrightarrow 128k$) and show improved performance in long context tasks, thus presenting a simple and efficient way to enable robust length generalization in general recurrent models.

Related papers

Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test [19.213961869113188]
We conduct the first study of grokking on checkpoints during one-pass pretraining of a 7B large language model (LLM), i.e., OLMoE.<n>Our study, for the first time, verifies that grokking still happens in the pretraining of large-scale foundation models.<n>We develop two novel metrics to quantify pathway distance and the complexity of a single pathway.
arXiv Detail & Related papers (2025-06-26T17:59:58Z)
Sequential-Parallel Duality in Prefix Scannable Models [68.39855814099997]
Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba.<n>This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference?
arXiv Detail & Related papers (2025-06-12T17:32:02Z)
Position as Probability: Self-Supervised Transformers that Think Past Their Training for Length Extrapolation [0.0]
PRISM is a novel positional encoding mechanism that enables Transformers to extrapolate accurately up to 10x beyond their training length.<n>Our analysis demonstrates that PRISM's positional encoding maintains sharp and interpretable internal states, providing a theoretical basis for reliable length generalization.
arXiv Detail & Related papers (2025-06-01T09:20:44Z)
Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models [51.03144354630136]
Generalization in natural data domains is progressively achieved during training before the onset of memorization.<n>Generalization vs. memorization is then best understood as a competition between time scales.<n>We show that this phenomenology is recovered in diffusion models learning a simple probabilistic context-free grammar with random rules.
arXiv Detail & Related papers (2025-05-22T17:40:08Z)
The Role of Sparsity for Length Generalization in Transformers [58.65997625433689]
We propose a new theoretical framework to study length generalization for the next-token prediction task.<n>We show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens.<n>We introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach.
arXiv Detail & Related papers (2025-02-24T03:01:03Z)
Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications. However, generalization properties of second-order methods are still being debated. We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z)
Grokking at the Edge of Linear Separability [1.024113475677323]
We analyze the long-time dynamics of logistic classification on a random feature model with a constant label. We find that Grokking is amplified when classification is applied to training sets which are on the verge of linear separability.
arXiv Detail & Related papers (2024-10-06T14:08:42Z)
On Provable Length and Compositional Generalization [7.883808173871223]
We provide first provable guarantees on length and compositional generalization for common sequence-to-sequence models.<n>We show that emphlimited capacity versions of these different architectures achieve both length and compositional generalization.
arXiv Detail & Related papers (2024-02-07T14:16:28Z)
Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z)
Simple Local Attentions Remain Competitive for Long-Context Tasks [32.785459927278616]
Many NLP tasks require processing long contexts beyond the length limit of pretrained models. In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed. For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks.
arXiv Detail & Related papers (2021-12-14T07:37:58Z)
Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers [21.09321438439848]
We introduce a simple sequence model inspired by control systems that generalize. We show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN-1s, and share features of NDEs such as time-scale adaptation.
arXiv Detail & Related papers (2021-10-26T19:44:53Z)
Better Fine-Tuning by Reducing Representational Collapse [77.44854918334232]
Existing approaches for fine-tuning pre-trained language models have been shown to be unstable. We present a method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise. We show it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.
arXiv Detail & Related papers (2020-08-06T02:13:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.