Geometric Dynamics of Signal Propagation Predict Trainability of
Transformers
- URL: http://arxiv.org/abs/2403.02579v1
- Date: Tue, 5 Mar 2024 01:30:34 GMT
- Title: Geometric Dynamics of Signal Propagation Predict Trainability of
Transformers
- Authors: Aditya Cowsik, Tamra Nebabu, Xiao-Liang Qi, Surya Ganguli
- Abstract summary: We investigate forward signal propagation and gradient back propagation in deep, randomly transformers.
Our approach treats the evolution of $n tokens as they propagate through the transformer layers.
We show through experiments that, remarkably, the final test loss at the end of training is well predicted just by these two exponents.
- Score: 22.25628914395565
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We investigate forward signal propagation and gradient back propagation in
deep, randomly initialized transformers, yielding simple necessary and
sufficient conditions on initialization hyperparameters that ensure
trainability of deep transformers. Our approach treats the evolution of the
representations of $n$ tokens as they propagate through the transformer layers
in terms of a discrete time dynamical system of $n$ interacting particles. We
derive simple update equations for the evolving geometry of this particle
system, starting from a permutation symmetric simplex. Our update equations
show that without MLP layers, this system will collapse to a line, consistent
with prior work on rank collapse in transformers. However, unlike prior work,
our evolution equations can quantitatively track particle geometry in the
additional presence of nonlinear MLP layers, and it reveals an order-chaos
phase transition as a function of initialization hyperparameters, like the
strength of attentional and MLP residual connections and weight variances. In
the ordered phase the particles are attractive and collapse to a line, while in
the chaotic phase the particles are repulsive and converge to a regular
$n$-simplex. We analytically derive two Lyapunov exponents: an angle exponent
that governs departures from the edge of chaos in this particle system, and a
gradient exponent that governs the rate of exponential growth or decay of
backpropagated gradients. We show through experiments that, remarkably, the
final test loss at the end of training is well predicted just by these two
exponents at the beginning of training, and that the simultaneous vanishing of
these two exponents yields a simple necessary and sufficient condition to
achieve minimal test loss.
Related papers
- In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Simulating scalar field theories on quantum computers with limited
resources [62.997667081978825]
We present a quantum algorithm for implementing $phi4$ lattice scalar field theory on qubit computers.
The algorithm allows efficient $phi4$ state preparation for a large range of input parameters in both the normal and broken symmetry phases.
arXiv Detail & Related papers (2022-10-14T17:28:15Z) - A DeepParticle method for learning and generating aggregation patterns
in multi-dimensional Keller-Segel chemotaxis systems [3.6184545598911724]
We study a regularized interacting particle method for computing aggregation patterns and near singular solutions of a Keller-Segal (KS) chemotaxis system in two and three space dimensions.
We further develop DeepParticle (DP) method to learn and generate solutions under variations of physical parameters.
arXiv Detail & Related papers (2022-08-31T20:52:01Z) - On optimization of coherent and incoherent controls for two-level
quantum systems [77.34726150561087]
This article considers some control problems for closed and open two-level quantum systems.
The closed system's dynamics is governed by the Schr"odinger equation with coherent control.
The open system's dynamics is governed by the Gorini-Kossakowski-Sudarshan-Lindblad master equation.
arXiv Detail & Related papers (2022-05-05T09:08:03Z) - A Score-based Geometric Model for Molecular Dynamics Simulations [33.158796937777886]
We propose a novel model called ScoreMD to estimate the gradient of the log density of molecular conformations.
With multiple architectural improvements, we outperforms state-of-the-art baselines on MD17 and isomers of C7O2H10.
This research provides new insights into the acceleration of new material and drug discovery.
arXiv Detail & Related papers (2022-04-19T05:13:46Z) - The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations,
and Anomalous Diffusion [29.489737359897312]
We study the limiting dynamics of deep neural networks trained with gradient descent (SGD)
We show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity and probability currents, which cause oscillations in phase space.
arXiv Detail & Related papers (2021-07-19T20:18:57Z) - Spectral Analysis of Product Formulas for Quantum Simulation [0.0]
We show that the Trotter step size needed to estimate an energy eigenvalue within precision can be improved in scaling from $epsilon$ to $epsilon1/2$ for a large class of systems.
Results partially generalize to diabatic processes, which remain in a narrow energy band separated from the rest of the spectrum by a gap.
arXiv Detail & Related papers (2021-02-25T03:17:25Z) - Discrete truncated Wigner approach to dynamical phase transitions in
Ising models after a quantum quench [0.0]
We study dynamical phase transitions arising in the steady state of transverse-field Ising models after a quantum quench.
We find identical exponents for $alpha lesssim 0.5$, suggesting that the dynamical transitions in this regime fall into the same universality class as the nonergodic mean-field limit.
arXiv Detail & Related papers (2020-04-21T08:20:15Z) - On Layer Normalization in the Transformer Architecture [112.40350994368741]
We first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters.
We show in experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines.
arXiv Detail & Related papers (2020-02-12T00:33:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.