Related papers: What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers

What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers

URL: http://arxiv.org/abs/2506.13688v1
Date: Mon, 16 Jun 2025 16:51:18 GMT
Title: What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers
Authors: Pulkit Gopalani, Wei Hu,
Abstract summary: This work investigates the underlying mechanisms for such dynamics, primarily in shallow Transformers.<n>We reveal that during the plateau, the model often develops an interpretable partial solution while simultaneously exhibiting a strong repetition bias in their outputs.<n>We validate that these identified phenomena-repetition bias and representation collapse-are not artifacts of toy setups but also manifest in the early pre-training stage of large language models like Pythia and OLMo.
Score: 9.575216516290237
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training Transformers on algorithmic tasks frequently demonstrates an intriguing abrupt learning phenomenon: an extended performance plateau followed by a sudden, sharp improvement. This work investigates the underlying mechanisms for such dynamics, primarily in shallow Transformers. We reveal that during the plateau, the model often develops an interpretable partial solution while simultaneously exhibiting a strong repetition bias in their outputs. This output degeneracy is accompanied by internal representation collapse, where hidden states across different tokens become nearly parallel. We further identify the slow learning of optimal attention maps as a key bottleneck. Hidden progress in attention configuration during the plateau precedes the eventual rapid convergence, and directly intervening on attention significantly alters plateau duration and the severity of repetition bias and representational collapse. We validate that these identified phenomena-repetition bias and representation collapse-are not artifacts of toy setups but also manifest in the early pre-training stage of large language models like Pythia and OLMo.

Related papers

The emergence of sparse attention: impact of data distribution and benefits of repetition [14.652502263025882]
We study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers.<n>By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics sparse attention emergence.<n>Our findings provide a simple, theoretically grounded framework for understanding how data distributions and model design influence the learning dynamics behind one form of emergence.
arXiv Detail & Related papers (2025-05-23T13:14:02Z)
New Evidence of the Two-Phase Learning Dynamics of Neural Networks [59.55028392232715]
We introduce an interval-wise perspective that compares network states across a time window.<n>We show that the response of the network to a perturbation exhibits a transition from chaotic to stable.<n>We also find that after this transition point the model's functional trajectory is confined to a narrow cone-shaped subset.
arXiv Detail & Related papers (2025-05-20T04:03:52Z)
On the Emergence of Position Bias in Transformers [59.87743433861665]
This paper presents a graph-theoretic framework for analyzing position biases in multilayer positions.<n>Our framework offers a principled foundation for understanding positional interplay in transformers.
arXiv Detail & Related papers (2025-02-04T02:53:07Z)
Abrupt Learning in Transformers: A Case Study on Matrix Completion [15.210510215283882]
We formulate the low-rank matrix completion problem as a masked language modeling (MLM) task. We show that it is possible to train a BERT model to solve this task to low error. We also analyze the training dynamics of individual model components to understand the sudden drop in loss.
arXiv Detail & Related papers (2024-10-29T17:08:06Z)
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights. We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z)
Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers [2.1572258716881905]
We explore how token-level sparsity evolves over the course of training, and how it connects to broader sparsity patterns. In particular, we demonstrate that the first and last layer of the network have distinctive and in many ways inverted relationships to sparsity. We additionally explore the phenomenon of ReLU dimensions "turning off", and show evidence suggesting that "neuron death" is being driven by the dynamics of training.
arXiv Detail & Related papers (2024-07-10T17:10:10Z)
Diagnosing Catastrophe: Large parts of accuracy loss in continual learning can be accounted for by readout misalignment [0.0]
Training artificial neural networks on changing data distributions leads to a rapid decrease in performance on old tasks. We investigate the representational changes that underlie this performance decrease and identify three distinct processes that together account for the phenomenon.
arXiv Detail & Related papers (2023-10-09T11:57:46Z)
Stabilizing Transformer Training by Preventing Attention Entropy Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers. We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z)
Exploring Transferable and Robust Adversarial Perturbation Generation from the Perspective of Network Hierarchy [52.153866313879924]
The transferability and robustness of adversarial examples are two practical yet important properties for black-box adversarial attacks. We propose a transferable and robust adversarial generation (TRAP) method. Our TRAP achieves impressive transferability and high robustness against certain interferences.
arXiv Detail & Related papers (2021-08-16T11:52:41Z)
Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation [0.0]
Recent work on the lottery ticket hypothesis has produced highly sparse Transformers for NMT while maintaining BLEU. By probing Transformers with more and more low-magnitude weights pruned away, we find that complex semantic information is first to be degraded. Analysis of internal activations reveals that higher layers diverge most over the course of pruning, gradually becoming less complex than their dense counterparts.
arXiv Detail & Related papers (2020-09-17T02:08:45Z)
Extreme Memorization via Scale of Initialization [72.78162454173803]
We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD. We find that the extent and manner in which generalization ability is affected depends on the activation and loss function used. In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function.
arXiv Detail & Related papers (2020-08-31T04:53:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.