Dyson Brownian motion and random matrix dynamics of weight matrices during learning
- URL: http://arxiv.org/abs/2411.13512v1
- Date: Wed, 20 Nov 2024 18:05:39 GMT
- Title: Dyson Brownian motion and random matrix dynamics of weight matrices during learning
- Authors: Gert Aarts, Ouraman Hajizadeh, Biagio Lucini, Chanju Park,
- Abstract summary: We first demonstrate that the dynamics can generically be described using Dyson Brownian motion.
The level ofity is shown to depend on the ratio of the learning rate and the mini-batch size.
We then study weight matrix dynamics in transformers following the evolution from a Marchenko-Pastur distribution for eigenvalues at initialisation to a combination with additional structure at the end of learning.
- Score: 0.0
- License:
- Abstract: During training, weight matrices in machine learning architectures are updated using stochastic gradient descent or variations thereof. In this contribution we employ concepts of random matrix theory to analyse the resulting stochastic matrix dynamics. We first demonstrate that the dynamics can generically be described using Dyson Brownian motion, leading to e.g. eigenvalue repulsion. The level of stochasticity is shown to depend on the ratio of the learning rate and the mini-batch size, explaining the empirically observed linear scaling rule. We verify this linear scaling in the restricted Boltzmann machine. Subsequently we study weight matrix dynamics in transformers (a nano-GPT), following the evolution from a Marchenko-Pastur distribution for eigenvalues at initialisation to a combination with additional structure at the end of learning.
Related papers
- Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models [76.52307406752556]
We derive a novel deterministic equivalence for the two-point function of a random resolvent.
We give a unified derivation of the performance of a wide variety of high-dimensional trained linear models with gradient descent.
arXiv Detail & Related papers (2025-02-07T16:45:40Z) - Fokker-Planck to Callan-Symanzik: evolution of weight matrices under training [9.257985820123]
We utilize Fokker-Planck to simulate the probability density evolution of individual weight matrices in the bottleneck layers of a simple 2-bottleneck-layered auto-encoder.
We also derive physically relevant partial differential equations such as Callan-Symanzik and Kardar-Parisi-Zhang equations from the dynamical equation we have.
arXiv Detail & Related papers (2025-01-16T16:54:40Z) - Random Matrix Theory for Stochastic Gradient Descent [0.0]
Investigating the dynamics of learning in machine learning algorithms is paramount importance for understanding how and why an approach may be successful.
Here we apply concepts from random matrix theory to describe weight matrix dynamics, using the framework of Dyson Brownian motion.
We derive the linear scaling rule between the learning rate (step size) and the batch size, and identify universal and non-universal aspects of weight matrix dynamics.
arXiv Detail & Related papers (2024-12-29T15:21:13Z) - Latent Space Energy-based Neural ODEs [73.01344439786524]
This paper introduces novel deep dynamical models designed to represent continuous-time sequences.
We train the model using maximum likelihood estimation with Markov chain Monte Carlo.
Experimental results on oscillating systems, videos and real-world state sequences (MuJoCo) demonstrate that our model with the learnable energy-based prior outperforms existing counterparts.
arXiv Detail & Related papers (2024-09-05T18:14:22Z) - Stochastic weight matrix dynamics during learning and Dyson Brownian motion [0.0]
We demonstrate that the update of weight matrices in learning algorithms can be described in the framework of Dyson Brownian motion.
We discuss universal and non-universal features in the gas distribution and identify the Wigner surmise and Wigner semicircle explicitly in a teacher-student model.
arXiv Detail & Related papers (2024-07-23T12:25:50Z) - Quantum trajectory entanglement in various unravelings of Markovian dynamics [0.0]
Cost of classical simulations of quantum many-body dynamics is often determined by the amount of entanglement in the system.
We study entanglement in quantum trajectory approaches that solve master equations describing open quantum system dynamics.
arXiv Detail & Related papers (2024-04-18T13:19:26Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Probabilistic Unrolling: Scalable, Inverse-Free Maximum Likelihood
Estimation for Latent Gaussian Models [69.22568644711113]
We introduce probabilistic unrolling, a method that combines Monte Carlo sampling with iterative linear solvers to circumvent matrix inversions.
Our theoretical analyses reveal that unrolling and backpropagation through the iterations of the solver can accelerate gradient estimation for maximum likelihood estimation.
In experiments on simulated and real data, we demonstrate that probabilistic unrolling learns latent Gaussian models up to an order of magnitude faster than gradient EM, with minimal losses in model performance.
arXiv Detail & Related papers (2023-06-05T21:08:34Z) - Capturing dynamical correlations using implicit neural representations [85.66456606776552]
We develop an artificial intelligence framework which combines a neural network trained to mimic simulated data from a model Hamiltonian with automatic differentiation to recover unknown parameters from experimental data.
In doing so, we illustrate the ability to build and train a differentiable model only once, which then can be applied in real-time to multi-dimensional scattering data.
arXiv Detail & Related papers (2023-04-08T07:55:36Z) - Graph Polynomial Convolution Models for Node Classification of
Non-Homophilous Graphs [52.52570805621925]
We investigate efficient learning from higher-order graph convolution and learning directly from adjacency matrix for node classification.
We show that the resulting model lead to new graphs and residual scaling parameter.
We demonstrate that the proposed methods obtain improved accuracy for node-classification of non-homophilous parameters.
arXiv Detail & Related papers (2022-09-12T04:46:55Z) - Fluctuation-dissipation Type Theorem in Stochastic Linear Learning [2.8292841621378844]
The fluctuation-dissipation theorem (FDT) is a simple yet powerful consequence of the first-order differential equation governing the dynamics of systems subject simultaneously to dissipative and forces.
The linear learning dynamics, in which the input vector maps to the output vector by a linear matrix whose elements are the subject of learning, has a verify version closely mimicking the Langevin dynamics when a full-batch gradient descent scheme is replaced by that of gradient descent.
We derive a generalized verify for the linear learning dynamics and its validity among the well-known machine learning data sets such as MNIST, CIFAR-10 and
arXiv Detail & Related papers (2021-06-04T02:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.