The emergence of sparse attention: impact of data distribution and benefits of repetition
- URL: http://arxiv.org/abs/2505.17863v1
- Date: Fri, 23 May 2025 13:14:02 GMT
- Title: The emergence of sparse attention: impact of data distribution and benefits of repetition
- Authors: Nicolas Zucchet, Francesco d'Angelo, Andrew K. Lampinen, Stephanie C. Y. Chan,
- Abstract summary: We study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers.<n>By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics sparse attention emergence.<n>Our findings provide a simple, theoretically grounded framework for understanding how data distributions and model design influence the learning dynamics behind one form of emergence.
- Score: 14.652502263025882
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emergence is a fascinating property of large language models and neural networks more broadly: as models scale and train for longer, they sometimes develop new abilities in sudden ways. Despite initial studies, we still lack a comprehensive understanding of how and when these abilities emerge. To address this gap, we study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers. By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics driving sparse attention emergence and reveal that emergence timing follows power laws based on task structure, architecture, and optimizer choice. We additionally find that repetition can greatly speed up emergence. Finally, we confirm these results on a well-studied in-context associative recall task. Our findings provide a simple, theoretically grounded framework for understanding how data distributions and model design influence the learning dynamics behind one form of emergence.
Related papers
- Symbolic Foundation Regressor on Complex Networks [8.438758359789462]
We introduce a pre-trained symbolic foundation regressor that can compress complex data with numerous interacting variables.<n>Our model has been rigorously tested on non-network symbolic regression, symbolic regression on complex networks, and the inference of network dynamics across various domains.
arXiv Detail & Related papers (2025-05-28T01:53:29Z) - Implicit bias produces neural scaling laws in learning curves, from perceptrons to deep networks [11.365318749216739]
We study the entire training dynamics through the lens of spectral complexity norms.<n>We identify two novel dynamical scaling laws that govern how performance evolves during training.<n>Our findings are consistent across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2025-05-19T15:13:36Z) - In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z) - Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond [61.18736646013446]
In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network.
Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena.
arXiv Detail & Related papers (2024-10-31T22:54:34Z) - Toward Understanding In-context vs. In-weight Learning [50.24035812301655]
We identify simplified distributional properties that give rise to the emergence and disappearance of in-context learning.<n>We then extend the study to a full large language model, showing how fine-tuning on various collections of natural language prompts can elicit similar in-context and in-weight learning behaviour.
arXiv Detail & Related papers (2024-10-30T14:09:00Z) - Cross-Entropy Is All You Need To Invert the Data Generating Process [29.94396019742267]
Empirical phenomena suggest that supervised models can learn interpretable factors of variation in a linear fashion.<n>Recent advances in self-supervised learning have shown that these methods can recover latent structures by inverting the data generating process.<n>We prove that even in standard classification tasks, models learn representations of ground-truth factors of variation up to a linear transformation.
arXiv Detail & Related papers (2024-10-29T09:03:57Z) - Spin glass model of in-context learning [2.285821277711785]
We study a transformer with linear attention and map this structure to a spin glass model with real-valued spins.<n>Our theory reveals that for single-instance learning, increasing the task diversity leads to the emergence of in-context learning.<n>The proposed analytically tractable model thus offers a promising avenue for thinking about how to interpret many intriguing but puzzling properties of large language models.
arXiv Detail & Related papers (2024-08-05T07:54:01Z) - Critical Learning Periods for Multisensory Integration in Deep Networks [112.40005682521638]
We show that the ability of a neural network to integrate information from diverse sources hinges critically on being exposed to properly correlated signals during the early phases of training.
We show that critical periods arise from the complex and unstable early transient dynamics, which are decisive of final performance of the trained system and their learned representations.
arXiv Detail & Related papers (2022-10-06T23:50:38Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - Leveraging the structure of dynamical systems for data-driven modeling [111.45324708884813]
We consider the impact of the training set and its structure on the quality of the long-term prediction.
We show how an informed design of the training set, based on invariants of the system and the structure of the underlying attractor, significantly improves the resulting models.
arXiv Detail & Related papers (2021-12-15T20:09:20Z) - Deep learning of contagion dynamics on complex networks [0.0]
We propose a complementary approach based on deep learning to build effective models of contagion dynamics on networks.
By allowing simulations on arbitrary network structures, our approach makes it possible to explore the properties of the learned dynamics beyond the training data.
Our results demonstrate how deep learning offers a new and complementary perspective to build effective models of contagion dynamics on networks.
arXiv Detail & Related papers (2020-06-09T17:18:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.