Machine learning in and out of equilibrium
- URL: http://arxiv.org/abs/2306.03521v1
- Date: Tue, 6 Jun 2023 09:12:49 GMT
- Title: Machine learning in and out of equilibrium
- Authors: Shishir Adhikari, Alkan Kabak\c{c}{\i}o\u{g}lu, Alexander Strang,
Deniz Yuret, Michael Hinczewski
- Abstract summary: Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
- Score: 58.88325379746631
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The algorithms used to train neural networks, like stochastic gradient
descent (SGD), have close parallels to natural processes that navigate a
high-dimensional parameter space -- for example protein folding or evolution.
Our study uses a Fokker-Planck approach, adapted from statistical physics, to
explore these parallels in a single, unified framework. We focus in particular
on the stationary state of the system in the long-time limit, which in
conventional SGD is out of equilibrium, exhibiting persistent currents in the
space of network parameters. As in its physical analogues, the current is
associated with an entropy production rate for any given training trajectory.
The stationary distribution of these rates obeys the integral and detailed
fluctuation theorems -- nonequilibrium generalizations of the second law of
thermodynamics. We validate these relations in two numerical examples, a
nonlinear regression network and MNIST digit classification. While the
fluctuation theorems are universal, there are other aspects of the stationary
state that are highly sensitive to the training details. Surprisingly, the
effective loss landscape and diffusion matrix that determine the shape of the
stationary distribution vary depending on the simple choice of minibatching
done with or without replacement. We can take advantage of this nonequilibrium
sensitivity to engineer an equilibrium stationary state for a particular
application: sampling from a posterior distribution of network weights in
Bayesian machine learning. We propose a new variation of stochastic gradient
Langevin dynamics (SGLD) that harnesses without replacement minibatching. In an
example system where the posterior is exactly known, this SGWORLD algorithm
outperforms SGLD, converging to the posterior orders of magnitude faster as a
function of the learning rate.
Related papers
- Law of Balance and Stationary Distribution of Stochastic Gradient
Descent [11.937085301750288]
We prove that the minibatch noise of gradient descent (SGD) regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry.
We then derive the stationary distribution of gradient flow for a diagonal linear network with arbitrary depth and width.
These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.
arXiv Detail & Related papers (2023-08-13T03:13:03Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Learning Neural Constitutive Laws From Motion Observations for
Generalizable PDE Dynamics [97.38308257547186]
Many NN approaches learn an end-to-end model that implicitly models both the governing PDE and material models.
We argue that the governing PDEs are often well-known and should be explicitly enforced rather than learned.
We introduce a new framework termed "Neural Constitutive Laws" (NCLaw) which utilizes a network architecture that strictly guarantees standard priors.
arXiv Detail & Related papers (2023-04-27T17:42:24Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Evolutionary Echo State Network: evolving reservoirs in the Fourier
space [1.7658686315825685]
The Echo State Network (ESN) is a class of Recurrent Neural Network with a large number of hidden-hidden weights (in the so-called reservoir)
We propose a new computational model of the ESN type, that represents the reservoir weights in the Fourier space and performs a fine-tuning of these weights applying genetic algorithms in the frequency domain.
arXiv Detail & Related papers (2022-06-10T08:59:40Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Variational Inference for Continuous-Time Switching Dynamical Systems [29.984955043675157]
We present a model based on an Markov jump process modulating a subordinated diffusion process.
We develop a new continuous-time variational inference algorithm.
We extensively evaluate our algorithm under the model assumption and for real-world examples.
arXiv Detail & Related papers (2021-09-29T15:19:51Z) - The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations,
and Anomalous Diffusion [29.489737359897312]
We study the limiting dynamics of deep neural networks trained with gradient descent (SGD)
We show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity and probability currents, which cause oscillations in phase space.
arXiv Detail & Related papers (2021-07-19T20:18:57Z) - The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.