Disentangling Feature Structure: A Mathematically Provable Two-Stage Training Dynamics in Transformers
- URL: http://arxiv.org/abs/2502.20681v1
- Date: Fri, 28 Feb 2025 03:27:24 GMT
- Title: Disentangling Feature Structure: A Mathematically Provable Two-Stage Training Dynamics in Transformers
- Authors: Zixuan Gong, Jiaye Teng, Yong Liu,
- Abstract summary: This paper theoretically demonstrates how such two-stage training dynamics occur in transformers.<n>To our best known, this is the first rigorous result regarding a two-stage optimization process in transformers.
- Score: 14.775485895006323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers may exhibit two-stage training dynamics during the real-world training process. For instance, when training GPT-2 on the Counterfact dataset, the answers progress from syntactically incorrect to syntactically correct to semantically correct. However, existing theoretical analyses hardly account for this two-stage phenomenon. In this paper, we theoretically demonstrate how such two-stage training dynamics occur in transformers. Specifically, we analyze the dynamics of transformers using feature learning techniques under in-context learning regimes, based on a disentangled two-type feature structure. Such disentanglement of feature structure is general in practice, e.g., natural languages contain syntax and semantics, and proteins contain primary and secondary structures. To our best known, this is the first rigorous result regarding a two-stage optimization process in transformers. Additionally, a corollary indicates that such a two-stage process is closely related to the spectral properties of the attention weights, which accords well with empirical findings.
Related papers
- Exact Learning Dynamics of In-Context Learning in Linear Transformers and Its Application to Non-Linear Transformers [1.7034813545878589]
Transformer models exhibit remarkable in-context learning (ICL)
Our work offers an exact dynamical model for ICL and theoretically grounded tools for analyzing complex transformer training.
arXiv Detail & Related papers (2025-04-17T13:05:33Z) - Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights.
This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task.
We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z) - How Transformers Get Rich: Approximation and Dynamics Analysis [11.789846138681359]
We provide both approximation and dynamics analyses of how transformers implement induction heads.<n>In the em approximation analysis, we formalize both standard and generalized induction head mechanisms.<n>For the em dynamics analysis, we study the training dynamics on a synthetic mixed target, composed of a 4-gram and an in-context 2-gram component.
arXiv Detail & Related papers (2024-10-15T10:22:27Z) - Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis [97.54180451650122]
We study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words.
We analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear layer.
We prove a novel property of the gradient flow, termed textitautomatic balancing of gradients, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss.
arXiv Detail & Related papers (2024-10-12T17:50:58Z) - Non-asymptotic Convergence of Training Transformers for Next-token Prediction [48.9399496805422]
Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data.
This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer.
We show that the trained transformer presents non-token prediction ability with dataset shift.
arXiv Detail & Related papers (2024-09-25T20:22:06Z) - Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training.
Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking.
Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.