Implicit Bias of MSE Gradient Optimization in Underparameterized Neural
Networks
- URL: http://arxiv.org/abs/2201.04738v1
- Date: Wed, 12 Jan 2022 23:28:41 GMT
- Title: Implicit Bias of MSE Gradient Optimization in Underparameterized Neural
Networks
- Authors: Benjamin Bowman and Guido Montufar
- Abstract summary: We study the dynamics of a neural network in function space when optimizing the mean squared error via gradient flow.
We show that the network learns eigenfunctions of an integral operator $T_Kinfty$ determined by the Neural Tangent Kernel (NTK)
We conclude that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the dynamics of a neural network in function space when optimizing
the mean squared error via gradient flow. We show that in the
underparameterized regime the network learns eigenfunctions of an integral
operator $T_{K^\infty}$ determined by the Neural Tangent Kernel (NTK) at rates
corresponding to their eigenvalues. For example, for uniformly distributed data
on the sphere $S^{d - 1}$ and rotation invariant weight distributions, the
eigenfunctions of $T_{K^\infty}$ are the spherical harmonics. Our results can
be understood as describing a spectral bias in the underparameterized regime.
The proofs use the concept of "Damped Deviations", where deviations of the NTK
matter less for eigendirections with large eigenvalues due to the occurence of
a damping factor. Aside from the underparameterized regime, the damped
deviations point-of-view can be used to track the dynamics of the empirical
risk in the overparameterized setting, allowing us to extend certain results in
the literature. We conclude that damped deviations offers a simple and unifying
perspective of the dynamics when optimizing the squared error.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Symmetries in the dynamics of wide two-layer neural networks [0.0]
We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias)
We first describe a general class of symmetries which, when satisfied by the target function $f*$ and the input distribution, are preserved by the dynamics.
arXiv Detail & Related papers (2022-11-16T08:59:26Z) - Single Trajectory Nonparametric Learning of Nonlinear Dynamics [8.438421942654292]
Given a single trajectory of a dynamical system, we analyze the performance of the nonparametric least squares estimator (LSE)
We leverage recently developed information-theoretic methods to establish the optimality of the LSE for non hypotheses classes.
We specialize our results to a number of scenarios of practical interest, such as Lipschitz dynamics, generalized linear models, and dynamics described by functions in certain classes of Reproducing Kernel Hilbert Spaces (RKHS)
arXiv Detail & Related papers (2022-02-16T19:38:54Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Efficient Semi-Implicit Variational Inference [65.07058307271329]
We propose an efficient and scalable semi-implicit extrapolational (SIVI)
Our method maps SIVI's evidence to a rigorous inference of lower gradient values.
arXiv Detail & Related papers (2021-01-15T11:39:09Z) - Inductive Bias of Gradient Descent for Exponentially Weight Normalized
Smooth Homogeneous Neural Nets [1.7259824817932292]
We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss.
This paper shows that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate.
arXiv Detail & Related papers (2020-10-24T14:34:56Z) - The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z) - Neural Control Variates [71.42768823631918]
We show that a set of neural networks can face the challenge of finding a good approximation of the integrand.
We derive a theoretically optimal, variance-minimizing loss function, and propose an alternative, composite loss for stable online training in practice.
Specifically, we show that the learned light-field approximation is of sufficient quality for high-order bounces, allowing us to omit the error correction and thereby dramatically reduce the noise at the cost of negligible visible bias.
arXiv Detail & Related papers (2020-06-02T11:17:55Z) - Solving high-dimensional eigenvalue problems using deep neural networks:
A diffusion Monte Carlo like approach [14.558626910178127]
The eigenvalue problem is reformulated as a fixed point problem of the semigroup flow induced by the operator.
The method shares a similar spirit with diffusion Monte Carlo but augments a direct approximation to the eigenfunction through neural-network ansatz.
Our approach is able to provide accurate eigenvalue and eigenfunction approximations in several numerical examples.
arXiv Detail & Related papers (2020-02-07T03:08:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.