Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation
- URL: http://arxiv.org/abs/2311.14120v4
- Date: Sun, 28 Jul 2024 18:28:55 GMT
- Title: Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation
- Authors: Markus Gross, Arne P. Raulf, Christoph Räth,
- Abstract summary: We investigate the stationary (late-time) training regime of single- and two-layer underparameterized linear neural networks.
We identify the inter-layer coupling as a distinct source of anisotropy for the weight fluctuations.
We provide an analytical derivation of the recently observed inverse variance-flatness relation in a model of a deep linear neural network.
- Score: 6.122833099916154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the stationary (late-time) training regime of single- and two-layer underparameterized linear neural networks within the continuum limit of stochastic gradient descent (SGD) for synthetic Gaussian data. In the case of a single-layer network in the weakly underparameterized regime, the spectrum of the noise covariance matrix deviates notably from the Hessian, which can be attributed to the broken detailed balance of SGD dynamics. The weight fluctuations are in this case generally anisotropic, but effectively experience an isotropic loss. For an underparameterized two-layer network, we describe the stochastic dynamics of the weights in each layer and analyze the associated stationary covariances. We identify the inter-layer coupling as a distinct source of anisotropy for the weight fluctuations. In contrast to the single-layer case, the weight fluctuations are effectively subject to an anisotropic loss, the flatness of which is inversely related to the fluctuation variance. We thereby provide an analytical derivation of the recently observed inverse variance-flatness relation in a model of a deep linear neural network.
Related papers
- Law of Balance and Stationary Distribution of Stochastic Gradient
Descent [11.937085301750288]
We prove that the minibatch noise of gradient descent (SGD) regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry.
We then derive the stationary distribution of gradient flow for a diagonal linear network with arbitrary depth and width.
These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.
arXiv Detail & Related papers (2023-08-13T03:13:03Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Stochastic Modified Equations and Dynamics of Dropout Algorithm [4.811269936680572]
Dropout is a widely utilized regularization technique in the training of neural networks.
Its underlying mechanism and its impact on achieving good abilities remain poorly understood.
arXiv Detail & Related papers (2023-05-25T08:42:25Z) - Capturing dynamical correlations using implicit neural representations [85.66456606776552]
We develop an artificial intelligence framework which combines a neural network trained to mimic simulated data from a model Hamiltonian with automatic differentiation to recover unknown parameters from experimental data.
In doing so, we illustrate the ability to build and train a differentiable model only once, which then can be applied in real-time to multi-dimensional scattering data.
arXiv Detail & Related papers (2023-04-08T07:55:36Z) - Learning Discretized Neural Networks under Ricci Flow [51.36292559262042]
We study Discretized Neural Networks (DNNs) composed of low-precision weights and activations.
DNNs suffer from either infinite or zero gradients due to the non-differentiable discrete function during training.
arXiv Detail & Related papers (2023-02-07T10:51:53Z) - Phenomenology of Double Descent in Finite-Width Neural Networks [29.119232922018732]
Double descent delineates the behaviour of models depending on the regime they belong to.
We use influence functions to derive suitable expressions of the population loss and its lower bound.
Building on our analysis, we investigate how the loss function affects double descent.
arXiv Detail & Related papers (2022-03-14T17:39:49Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Equivariant Wavelets: Fast Rotation and Translation Invariant Wavelet
Scattering Transforms [0.0]
Imposing symmetry on image statistics can improve human interpretability, aid in generalization, and provide dimension reduction.
We introduce a fast-to-compute, translationally invariant and rotationally equivariant wavelet scattering network.
arXiv Detail & Related papers (2021-04-22T18:00:01Z) - The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.