Related papers: Geometric Dynamics of Signal Propagation Predict Trainability of Transformers

Geometric Dynamics of Signal Propagation Predict Trainability of Transformers

URL: http://arxiv.org/abs/2403.02579v1
Date: Tue, 5 Mar 2024 01:30:34 GMT
Title: Geometric Dynamics of Signal Propagation Predict Trainability of Transformers
Authors: Aditya Cowsik, Tamra Nebabu, Xiao-Liang Qi, Surya Ganguli
Abstract summary: We investigate forward signal propagation and gradient back propagation in deep, randomly transformers. Our approach treats the evolution of $n tokens as they propagate through the transformer layers. We show through experiments that, remarkably, the final test loss at the end of training is well predicted just by these two exponents.
Score: 22.25628914395565
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We investigate forward signal propagation and gradient back propagation in deep, randomly initialized transformers, yielding simple necessary and sufficient conditions on initialization hyperparameters that ensure trainability of deep transformers. Our approach treats the evolution of the representations of $n$ tokens as they propagate through the transformer layers in terms of a discrete time dynamical system of $n$ interacting particles. We derive simple update equations for the evolving geometry of this particle system, starting from a permutation symmetric simplex. Our update equations show that without MLP layers, this system will collapse to a line, consistent with prior work on rank collapse in transformers. However, unlike prior work, our evolution equations can quantitatively track particle geometry in the additional presence of nonlinear MLP layers, and it reveals an order-chaos phase transition as a function of initialization hyperparameters, like the strength of attentional and MLP residual connections and weight variances. In the ordered phase the particles are attractive and collapse to a line, while in the chaotic phase the particles are repulsive and converge to a regular $n$-simplex. We analytically derive two Lyapunov exponents: an angle exponent that governs departures from the edge of chaos in this particle system, and a gradient exponent that governs the rate of exponential growth or decay of backpropagated gradients. We show through experiments that, remarkably, the final test loss at the end of training is well predicted just by these two exponents at the beginning of training, and that the simultaneous vanishing of these two exponents yields a simple necessary and sufficient condition to achieve minimal test loss.

Related papers

Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z)
Mechanistic Insights into Grokking from the Embedding Layer [15.676058752772287]
Grokking, a delayed generalization in neural networks, has been observed in Transformers and stagnates, but the components driving it remain underexplored.<n>We show that embeddings are central to grokking: introducing them intos induces delayed generalization in modular arithmetic tasks.<n>Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.
arXiv Detail & Related papers (2025-05-21T15:12:34Z)
Digital quantum simulation of the Su-Schrieffer-Heeger model using a parameterized quantum circuit [1.4998308221771977]
We perform digital quantum simulations of the Su-Schrieffer-Heeger model using a parameterized quantum circuit. We investigate the evolution of the energy, entanglement entropy, and mutual information towards nontrivial ground states.
arXiv Detail & Related papers (2025-04-10T06:54:10Z)
A Unified Perspective on the Dynamics of Deep Transformers [24.094975798576783]
We study the evolution of data anisotropy through a deep Transformer. We highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.
arXiv Detail & Related papers (2025-01-30T13:04:54Z)
Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis [97.54180451650122]
We study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words. We analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear layer. We prove a novel property of the gradient flow, termed textitautomatic balancing of gradients, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss.
arXiv Detail & Related papers (2024-10-12T17:50:58Z)
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape. This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z)
Non-asymptotic Convergence of Training Transformers for Next-token Prediction [48.9399496805422]
Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer. We show that the trained transformer presents non-token prediction ability with dataset shift.
arXiv Detail & Related papers (2024-09-25T20:22:06Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels. We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium. We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z)
On optimization of coherent and incoherent controls for two-level quantum systems [77.34726150561087]
This article considers some control problems for closed and open two-level quantum systems. The closed system's dynamics is governed by the Schr"odinger equation with coherent control. The open system's dynamics is governed by the Gorini-Kossakowski-Sudarshan-Lindblad master equation.
arXiv Detail & Related papers (2022-05-05T09:08:03Z)
A Score-based Geometric Model for Molecular Dynamics Simulations [33.158796937777886]
We propose a novel model called ScoreMD to estimate the gradient of the log density of molecular conformations. With multiple architectural improvements, we outperforms state-of-the-art baselines on MD17 and isomers of C7O2H10. This research provides new insights into the acceleration of new material and drug discovery.
arXiv Detail & Related papers (2022-04-19T05:13:46Z)
The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations, and Anomalous Diffusion [29.489737359897312]
We study the limiting dynamics of deep neural networks trained with gradient descent (SGD) We show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity and probability currents, which cause oscillations in phase space.
arXiv Detail & Related papers (2021-07-19T20:18:57Z)
Discrete truncated Wigner approach to dynamical phase transitions in Ising models after a quantum quench [0.0]
We study dynamical phase transitions arising in the steady state of transverse-field Ising models after a quantum quench. We find identical exponents for $alpha lesssim 0.5$, suggesting that the dynamical transitions in this regime fall into the same universality class as the nonergodic mean-field limit.
arXiv Detail & Related papers (2020-04-21T08:20:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.