Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization
- URL: http://arxiv.org/abs/2510.08341v1
- Date: Thu, 09 Oct 2025 15:26:48 GMT
- Title: Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization
- Authors: Pál Zsámboki, Benjamin Levi, David Ansel Josef Smith, Mitansh Kagalwala, Arlington Kell, Samuel Liechty, Cong Wang,
- Abstract summary: We study length generalization in transformers through the set complement task.<n>A model must predict a uniform distribution over tokens absent from an input sequence.<n>We show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences.
- Score: 3.8776893257232032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study length generalization in transformers through the set complement task, where a model must predict a uniform distribution over tokens absent from an input sequence -- an ability central to board-game style reasoning. Our main theoretical result establishes two statements. First, we prove tight bounds on embedding and value dimensions for single-layer attention-only transformers. Second, we show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences, though with reduced precision. A mechanistic reading of the proof explains this limitation: as more tokens are attended to, softmax compresses logit displacements, eroding separation between valid and invalid outputs. Training dynamics also suggest a second obstacle: when many next tokens are possible, updates become noisy. We hypothesize that dropout can counteract the first effect and Exponential Moving Average (EMA) the second. We validate these hypotheses through random hyperparameter search on the set complement task, which confirms both mechanisms. We then test OthelloGPT, a GPT-1 style model trained on random Othello moves, and find that EMA again improves length generalization in this more complex setting.
Related papers
- Spilled Energy in Large Language Models [3.434649016649368]
We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM)<n>This principled approach allows us to track "energy spills" during decoding, which we empirically show correlate with factual errors, biases, and failures.
arXiv Detail & Related papers (2026-02-21T00:38:47Z) - Generation Order and Parallel Decoding in Masked Diffusion Models: An Information-Theoretic Perspective [16.942478643768144]
Masked Diffusion Models (MDMs) significantly accelerate inference by trading off sequential determinism.<n>We provide a unified information-theoretic framework to decouple and analyze two fundamental sources of failure: order sensitivity and parallelization bias.
arXiv Detail & Related papers (2026-01-30T20:15:18Z) - Quantitative Bounds for Length Generalization in Transformers [58.175107357008876]
We study the problem of length generalization (LG) in transformers.<n>LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences.
arXiv Detail & Related papers (2025-10-30T21:31:36Z) - Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation [8.973965016201822]
Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance.<n>In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to instability.<n>Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and gradients.
arXiv Detail & Related papers (2025-05-30T08:18:23Z) - Born a Transformer -- Always a Transformer? [57.37263095476691]
We study a family of $textitretrieval$ and $textitcopying$ tasks inspired by Liu et al.<n>We observe an $textitinduction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) than the left (anti-induction) of a query token.<n>Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers.
arXiv Detail & Related papers (2025-05-27T21:36:50Z) - Mechanistic Insights into Grokking from the Embedding Layer [15.676058752772287]
Grokking, a delayed generalization in neural networks, has been observed in Transformers and stagnates, but the components driving it remain underexplored.<n>We show that embeddings are central to grokking: introducing them intos induces delayed generalization in modular arithmetic tasks.<n>Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.
arXiv Detail & Related papers (2025-05-21T15:12:34Z) - Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking [50.465604300990904]
Grokking refers to the abrupt improvement in test accuracy after extended overfitting.<n>In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations.
arXiv Detail & Related papers (2025-04-04T04:42:38Z) - An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models [64.87562101662952]
We show that input tokens are often exchangeable since they already include positional encodings.
We establish the existence of a sufficient and minimal representation of input tokens.
We prove that attention with the desired parameter infers the latent posterior up to an approximation error.
arXiv Detail & Related papers (2022-12-30T17:59:01Z) - WR-ONE2SET: Towards Well-Calibrated Keyphrase Generation [57.11538133231843]
Keyphrase generation aims to automatically generate short phrases summarizing an input document.
The recently emerged ONE2SET paradigm generates keyphrases as a set and has achieved competitive performance.
We propose WR-ONE2SET which extends ONE2SET with an adaptive instance-level cost Weighting strategy and a target Re-assignment mechanism.
arXiv Detail & Related papers (2022-11-13T09:56:24Z) - Transformers meet Stochastic Block Models: Attention with Data-Adaptive
Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention.
We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model.
Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.