Loss Landscape Degeneracy Drives Stagewise Development in Transformers
- URL: http://arxiv.org/abs/2402.02364v2
- Date: Thu, 13 Feb 2025 07:29:46 GMT
- Title: Loss Landscape Degeneracy Drives Stagewise Development in Transformers
- Authors: Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, Daniel Murfet,
- Abstract summary: We show that training can be divided into distinct periods of change in loss landscape degeneracy.
This finding underscores the potential of a degeneracy-based perspective for understanding modern deep learning.
- Score: 1.947473271879451
- License:
- Abstract: Deep learning involves navigating a high-dimensional loss landscape over the neural network parameter space. Over the course of training, complex computational structures form and re-form inside the neural network, leading to shifts in input/output behavior. It is a priority for the science of deep learning to uncover principles governing the development of neural network structure and behavior. Drawing on the framework of singular learning theory, we propose that model development is deeply linked to degeneracy in the local geometry of the loss landscape. We investigate this link by monitoring loss landscape degeneracy throughout training, as quantified by the local learning coefficient, for a transformer language model and an in-context linear regression transformer. We show that training can be divided into distinct periods of change in loss landscape degeneracy, and that these changes in degeneracy coincide with significant changes in the internal computational structure and the input/output behavior of the transformers. This finding underscores the potential of a degeneracy-based perspective for understanding modern deep learning.
Related papers
- Dynamics of Transient Structure in In-Context Linear Regression Transformers [0.5242869847419834]
We show that when transformers are trained on in-context linear regression tasks with intermediate task diversity, they behave like ridge regression before specializing to the tasks in their training distribution.
This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis.
We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.
arXiv Detail & Related papers (2025-01-29T16:32:14Z) - Inverting Visual Representations with Detection Transformers [0.8124699127636158]
We apply the approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer.
We demonstrate critical properties of Detection Transformers, including contextual shape robustness, inter-layer correlation, and preservation to color perturbations.
arXiv Detail & Related papers (2024-12-09T14:43:06Z) - Dynamical stability and chaos in artificial neural network trajectories along training [3.379574469735166]
We study the dynamical properties of this process by analyzing through this lens the network trajectories of a shallow neural network.
We find hints of regular and chaotic behavior depending on the learning rate regime.
This work also contributes to the cross-fertilization of ideas between dynamical systems theory, network theory and machine learning.
arXiv Detail & Related papers (2024-04-08T17:33:11Z) - Emergent learning in physical systems as feedback-based aging in a
glassy landscape [0.0]
We show that the learning dynamics resembles an aging process, where the system relaxes in response to repeated application of the feedback boundary forces.
We also observe that the square root of the mean-squared error as a function of epoch takes on a non-exponential form, which is a typical feature of glassy systems.
arXiv Detail & Related papers (2023-09-08T15:24:55Z) - Unsupervised Learning of Invariance Transformations [105.54048699217668]
We develop an algorithmic framework for finding approximate graph automorphisms.
We discuss how this framework can be used to find approximate automorphisms in weighted graphs in general.
arXiv Detail & Related papers (2023-07-24T17:03:28Z) - Centered Self-Attention Layers [89.21791761168032]
The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied.
We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers.
We present a correction term to the aggregating operator of these mechanisms.
arXiv Detail & Related papers (2023-06-02T15:19:08Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Going beyond p-convolutions to learn grayscale morphological operators [64.38361575778237]
We present two new morphological layers based on the same principle as the p-convolutional layer.
In this work, we present two new morphological layers based on the same principle as the p-convolutional layer.
arXiv Detail & Related papers (2021-02-19T17:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.