From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics
- URL: http://arxiv.org/abs/2510.06954v1
- Date: Wed, 08 Oct 2025 12:37:53 GMT
- Title: From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics
- Authors: Zheng-An Chen, Tao Luo,
- Abstract summary: We use the gradient flow analytical framework to systematically investigate linearized Transformer training dynamics.<n>Our theoretical analysis dissects the dynamics of attention modules into two distinct stages.
- Score: 3.247992990696076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in [Zhou et al. NeurIPS 2022] to systematically investigate linearized Transformer training dynamics. Our theoretical analysis dissects the dynamics of attention modules into two distinct stages. In the first stage, asymmetric weight perturbations from random initialization sustain non-degenerate gradient dynamics in parameter matrices, facilitating systematic escape from small initialization regimes. Subsequently, these matrices undergo condensation, progressively aligning toward the target orientation. In the second stage, the previously static key-query matrices actively participate in training, driving the normalized matrices toward asymptotic rank collapse. This two-stage framework generalizes classical directional convergence results.
Related papers
- ODELoRA: Training Low-Rank Adaptation by Solving Ordinary Differential Equations [54.886931928255564]
Low-rank adaptation (LoRA) has emerged as a widely adopted parameter-efficient fine-tuning method in deep transfer learning.<n>We propose a novel continuous-time optimization dynamic for LoRA factor matrices in the form of an ordinary differential equation (ODE)<n>We show that ODELoRA achieves stable feature learning, a property that is crucial for training deep neural networks at different scales of problem dimensionality.
arXiv Detail & Related papers (2026-02-07T10:19:36Z) - Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics [29.85277126753054]
Cross-entropy (CE) loss dominates deep learning, yet existing theory often relies on simplifications.<n>We provide an in-depth characterization of a canonical network with standard neural-basis vectors.
arXiv Detail & Related papers (2025-12-03T17:45:09Z) - Understanding Post-Training Structural Changes in Large Language Models [3.054513120350576]
Post-training fundamentally alters the behavior of large language models (LLMs)<n>This work focuses on two widely adopted post-training methods: instruction tuning and long-chain-of-thought (Long-CoT) distillation.
arXiv Detail & Related papers (2025-09-22T15:03:36Z) - Exact Learning Dynamics of In-Context Learning in Linear Transformers and Its Application to Non-Linear Transformers [1.7034813545878589]
Transformer models exhibit remarkable in-context learning (ICL)<n>Our work offers an exact dynamical model for ICL and theoretically grounded tools for analyzing complex transformer training.
arXiv Detail & Related papers (2025-04-17T13:05:33Z) - In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z) - Training Dynamics of In-Context Learning in Linear Attention [6.663503238373593]
We study the gradient descent dynamics of multi-head linear self-attention trained for in-context linear regression.<n>We provide a theoretical description of how ICL abilities evolve during gradient descent training of linear attention.
arXiv Detail & Related papers (2025-01-27T18:03:00Z) - Stability properties of gradient flow dynamics for the symmetric low-rank matrix factorization problem [22.648448759446907]
We show that a low-rank factorization serves as a building block in many learning tasks.
We offer new insight into the shape of the trajectories associated with local search parts of the dynamics.
arXiv Detail & Related papers (2024-11-24T20:05:10Z) - Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis [97.54180451650122]
We study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words.
We analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear layer.
We prove a novel property of the gradient flow, termed textitautomatic balancing of gradients, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss.
arXiv Detail & Related papers (2024-10-12T17:50:58Z) - Non-asymptotic Convergence of Training Transformers for Next-token Prediction [48.9399496805422]
Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data.
This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer.
We show that the trained transformer presents non-token prediction ability with dataset shift.
arXiv Detail & Related papers (2024-09-25T20:22:06Z) - Understanding Incremental Learning of Gradient Descent: A Fine-grained
Analysis of Matrix Sensing [74.2952487120137]
It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in machine learning models.
This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem.
arXiv Detail & Related papers (2023-01-27T02:30:51Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.