Dynamic metastability in the self-attention model
- URL: http://arxiv.org/abs/2410.06833v1
- Date: Wed, 9 Oct 2024 12:50:50 GMT
- Title: Dynamic metastability in the self-attention model
- Authors: Borjan Geshkovski, Hugo Koubbi, Yury Polyanskiy, Philippe Rigollet,
- Abstract summary: We consider the self-attention model - an interacting particle system on the unit sphere - which serves as a toy model for Transformers.
We prove the appearance of dynamic metastability conjectured in [GLPR23].
We show that under an appropriate time-rescaling, the energy reaches its global maximum in finite time and has a staircase profile.
- Score: 22.689695473655906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a configuration of several clusters for an exponentially long period of time. By leveraging a gradient flow interpretation of the system, we also connect our result to an overarching framework of slow motion of gradient flows proposed by Otto and Reznikoff [OR07] in the context of coarsening and the Allen-Cahn equation. We finally probe the dynamics beyond the exponentially long period of metastability, and illustrate that, under an appropriate time-rescaling, the energy reaches its global maximum in finite time and has a staircase profile, with trajectories manifesting saddle-to-saddle-like behavior, reminiscent of recent works in the analysis of training dynamics via gradient descent for two-layer neural networks.
Related papers
- Latent Space Energy-based Neural ODEs [73.01344439786524]
This paper introduces a novel family of deep dynamical models designed to represent continuous-time sequence data.
We train the model using maximum likelihood estimation with Markov chain Monte Carlo.
Experiments on oscillating systems, videos and real-world state sequences (MuJoCo) illustrate that ODEs with the learnable energy-based prior outperform existing counterparts.
arXiv Detail & Related papers (2024-09-05T18:14:22Z) - Annealing Dynamics of Regular Rotor Networks: Universality and Its Breakdown [0.0]
The spin-vector Langevin (SVL) model has been proposed and tested as an alternative to the Monte Carlo model.
We study the nonequilibrium dynamics of classical O(2) rotors on regular graphs.
Our results establish a universal breakdown of the Kibble-Zurek mechanism in classical systems characterized by long-range interactions.
arXiv Detail & Related papers (2024-07-12T14:55:25Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Spreading of a local excitation in a Quantum Hierarchical Model [62.997667081978825]
We study the dynamics of the quantum Dyson hierarchical model in its paramagnetic phase.
An initial state made by a local excitation of the paramagnetic ground state is considered.
A localization mechanism is found and the excitation remains close to its initial position at arbitrary times.
arXiv Detail & Related papers (2022-07-14T10:05:20Z) - Convex Analysis of the Mean Field Langevin Dynamics [49.66486092259375]
convergence rate analysis of the mean field Langevin dynamics is presented.
$p_q$ associated with the dynamics allows us to develop a convergence theory parallel to classical results in convex optimization.
arXiv Detail & Related papers (2022-01-25T17:13:56Z) - Predicting Physics in Mesh-reduced Space with Temporal Attention [15.054026802351146]
We propose a new method that captures long-term dependencies through a transformer-style temporal attention model.
Our method outperforms a competitive GNN baseline on several complex fluid dynamics prediction tasks.
We believe our approach paves the way to bringing the benefits of attention-based sequence models to solving high-dimensional complex physics tasks.
arXiv Detail & Related papers (2022-01-22T18:32:54Z) - The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations,
and Anomalous Diffusion [29.489737359897312]
We study the limiting dynamics of deep neural networks trained with gradient descent (SGD)
We show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity and probability currents, which cause oscillations in phase space.
arXiv Detail & Related papers (2021-07-19T20:18:57Z) - Learning Continuous System Dynamics from Irregularly-Sampled Partial
Observations [33.63818978256567]
We present LG-ODE, a latent ordinary differential equation generative model for modeling multi-agent dynamic system with known graph structure.
It can simultaneously learn the embedding of high dimensional trajectories and infer continuous latent system dynamics.
Our model employs a novel encoder parameterized by a graph neural network that can infer initial states in an unsupervised way.
arXiv Detail & Related papers (2020-11-08T01:02:22Z) - Continuous-in-Depth Neural Networks [107.47887213490134]
We first show that ResNets fail to be meaningful dynamical in this richer sense.
We then demonstrate that neural network models can learn to represent continuous dynamical systems.
We introduce ContinuousNet as a continuous-in-depth generalization of ResNet architectures.
arXiv Detail & Related papers (2020-08-05T22:54:09Z) - Liquid Time-constant Networks [117.57116214802504]
We introduce a new class of time-continuous recurrent neural network models.
Instead of declaring a learning system's dynamics by implicit nonlinearities, we construct networks of linear first-order dynamical systems.
These neural networks exhibit stable and bounded behavior, yield superior expressivity within the family of neural ordinary differential equations.
arXiv Detail & Related papers (2020-06-08T09:53:35Z) - Semiclassical dynamics of a disordered two-dimensional Hubbard model
with long-range interactions [0.0]
We analyze Quench dynamics in a two-dimensional system of interacting fermions.
For a weak and moderate disorder strength, we observe subdiffusive behavior of charges, while spins exhibit diffusive dynamics.
In contrast to the short-range model, strong inhomogeneities such as domain walls in the initial state can significantly slow down thermalization dynamics.
arXiv Detail & Related papers (2020-02-13T14:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.