Related papers: Dyson Brownian motion and random matrix dynamics of weight matrices during learning

Dyson Brownian motion and random matrix dynamics of weight matrices during learning

URL: http://arxiv.org/abs/2411.13512v1
Date: Wed, 20 Nov 2024 18:05:39 GMT
Title: Dyson Brownian motion and random matrix dynamics of weight matrices during learning
Authors: Gert Aarts, Ouraman Hajizadeh, Biagio Lucini, Chanju Park,
Abstract summary: We first demonstrate that the dynamics can generically be described using Dyson Brownian motion. The level ofity is shown to depend on the ratio of the learning rate and the mini-batch size. We then study weight matrix dynamics in transformers following the evolution from a Marchenko-Pastur distribution for eigenvalues at initialisation to a combination with additional structure at the end of learning.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: During training, weight matrices in machine learning architectures are updated using stochastic gradient descent or variations thereof. In this contribution we employ concepts of random matrix theory to analyse the resulting stochastic matrix dynamics. We first demonstrate that the dynamics can generically be described using Dyson Brownian motion, leading to e.g. eigenvalue repulsion. The level of stochasticity is shown to depend on the ratio of the learning rate and the mini-batch size, explaining the empirically observed linear scaling rule. We verify this linear scaling in the restricted Boltzmann machine. Subsequently we study weight matrix dynamics in transformers (a nano-GPT), following the evolution from a Marchenko-Pastur distribution for eigenvalues at initialisation to a combination with additional structure at the end of learning.

Related papers

Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models [76.52307406752556]
We derive a novel deterministic equivalence for the two-point function of a random resolvent. We give a unified derivation of the performance of a wide variety of high-dimensional trained linear models with gradient descent.
arXiv Detail & Related papers (2025-02-07T16:45:40Z)
Fokker-Planck to Callan-Symanzik: evolution of weight matrices under training [9.257985820123]
We utilize Fokker-Planck to simulate the probability density evolution of individual weight matrices in the bottleneck layers of a simple 2-bottleneck-layered auto-encoder. We also derive physically relevant partial differential equations such as Callan-Symanzik and Kardar-Parisi-Zhang equations from the dynamical equation we have.
arXiv Detail & Related papers (2025-01-16T16:54:40Z)
Random Matrix Theory for Stochastic Gradient Descent [0.0]
Investigating the dynamics of learning in machine learning algorithms is paramount importance for understanding how and why an approach may be successful. Here we apply concepts from random matrix theory to describe weight matrix dynamics, using the framework of Dyson Brownian motion. We derive the linear scaling rule between the learning rate (step size) and the batch size, and identify universal and non-universal aspects of weight matrix dynamics.
arXiv Detail & Related papers (2024-12-29T15:21:13Z)
Truncated Gaussian basis approach for simulating many-body dynamics [0.0]
The approach constructs an effective Hamiltonian within a reduced subspace, spanned by fermionic Gaussian states, and diagonalizes it to obtain approximate eigenstates and eigenenergies. Symmetries can be exploited to perform parallel computation, enabling to simulate systems with much larger sizes. For quench dynamics we observe that time-evolving wave functions in the truncated subspace facilitates the simulation of long-time dynamics.
arXiv Detail & Related papers (2024-10-05T15:47:01Z)
Latent Space Energy-based Neural ODEs [73.01344439786524]
This paper introduces a novel family of deep dynamical models designed to represent continuous-time sequence data. We train the model using maximum likelihood estimation with Markov chain Monte Carlo. Experiments on oscillating systems, videos and real-world state sequences (MuJoCo) illustrate that ODEs with the learnable energy-based prior outperform existing counterparts.
arXiv Detail & Related papers (2024-09-05T18:14:22Z)
Stochastic weight matrix dynamics during learning and Dyson Brownian motion [0.0]
We demonstrate that the update of weight matrices in learning algorithms can be described in the framework of Dyson Brownian motion. We discuss universal and non-universal features in the gas distribution and identify the Wigner surmise and Wigner semicircle explicitly in a teacher-student model.
arXiv Detail & Related papers (2024-07-23T12:25:50Z)
Quantum trajectory entanglement in various unravelings of Markovian dynamics [0.0]
Cost of classical simulations of quantum many-body dynamics is often determined by the amount of entanglement in the system. We study entanglement in quantum trajectory approaches that solve master equations describing open quantum system dynamics.
arXiv Detail & Related papers (2024-04-18T13:19:26Z)
Implicit Regularization of Gradient Flow on One-Layer Softmax Attention [10.060496091806694]
We study gradient flow on the exponential loss for a classification problem with a one-layer softmax attention model. Under a separability assumption on the data, we show that when gradient flow achieves the minimal loss value, it further implicitly minimizes the nuclear norm of the product of the key and query weight matrices.
arXiv Detail & Related papers (2024-03-13T17:02:27Z)
Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels. We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium. We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z)
Probabilistic Unrolling: Scalable, Inverse-Free Maximum Likelihood Estimation for Latent Gaussian Models [69.22568644711113]
We introduce probabilistic unrolling, a method that combines Monte Carlo sampling with iterative linear solvers to circumvent matrix inversions. Our theoretical analyses reveal that unrolling and backpropagation through the iterations of the solver can accelerate gradient estimation for maximum likelihood estimation. In experiments on simulated and real data, we demonstrate that probabilistic unrolling learns latent Gaussian models up to an order of magnitude faster than gradient EM, with minimal losses in model performance.
arXiv Detail & Related papers (2023-06-05T21:08:34Z)
Capturing dynamical correlations using implicit neural representations [85.66456606776552]
We develop an artificial intelligence framework which combines a neural network trained to mimic simulated data from a model Hamiltonian with automatic differentiation to recover unknown parameters from experimental data. In doing so, we illustrate the ability to build and train a differentiable model only once, which then can be applied in real-time to multi-dimensional scattering data.
arXiv Detail & Related papers (2023-04-08T07:55:36Z)
Graph Polynomial Convolution Models for Node Classification of Non-Homophilous Graphs [52.52570805621925]
We investigate efficient learning from higher-order graph convolution and learning directly from adjacency matrix for node classification. We show that the resulting model lead to new graphs and residual scaling parameter. We demonstrate that the proposed methods obtain improved accuracy for node-classification of non-homophilous parameters.
arXiv Detail & Related papers (2022-09-12T04:46:55Z)
Fluctuation-dissipation Type Theorem in Stochastic Linear Learning [2.8292841621378844]
The fluctuation-dissipation theorem (FDT) is a simple yet powerful consequence of the first-order differential equation governing the dynamics of systems subject simultaneously to dissipative and forces. The linear learning dynamics, in which the input vector maps to the output vector by a linear matrix whose elements are the subject of learning, has a verify version closely mimicking the Langevin dynamics when a full-batch gradient descent scheme is replaced by that of gradient descent. We derive a generalized verify for the linear learning dynamics and its validity among the well-known machine learning data sets such as MNIST, CIFAR-10 and
arXiv Detail & Related papers (2021-06-04T02:54:26Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.