The Implicit Bias of Minima Stability in Multivariate Shallow ReLU
Networks
- URL: http://arxiv.org/abs/2306.17499v1
- Date: Fri, 30 Jun 2023 09:17:39 GMT
- Title: The Implicit Bias of Minima Stability in Multivariate Shallow ReLU
Networks
- Authors: Mor Shpigel Nacson, Rotem Mulayoff, Greg Ongie, Tomer Michaeli, Daniel
Soudry
- Abstract summary: We study the type of solutions to which gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss.
We prove that although shallow ReLU networks are universal approximators, stable shallow networks are not.
- Score: 53.95175206863992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the type of solutions to which stochastic gradient descent converges
when used to train a single hidden-layer multivariate ReLU network with the
quadratic loss. Our results are based on a dynamical stability analysis. In the
univariate case, it was shown that linearly stable minima correspond to network
functions (predictors), whose second derivative has a bounded weighted $L^1$
norm. Notably, the bound gets smaller as the step size increases, implying that
training with a large step size leads to `smoother' predictors. Here we
generalize this result to the multivariate case, showing that a similar result
applies to the Laplacian of the predictor. We demonstrate the tightness of our
bound on the MNIST dataset, and show that it accurately captures the behavior
of the solutions as a function of the step size. Additionally, we prove a depth
separation result on the approximation power of ReLU networks corresponding to
stable minima of the loss. Specifically, although shallow ReLU networks are
universal approximators, we prove that stable shallow networks are not. Namely,
there is a function that cannot be well-approximated by stable single
hidden-layer ReLU networks trained with a non-vanishing step size. This is
while the same function can be realized as a stable two hidden-layer ReLU
network. Finally, we prove that if a function is sufficiently smooth (in a
Sobolev sense) then it can be approximated arbitrarily well using shallow ReLU
networks that correspond to stable solutions of gradient descent.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Approximating Positive Homogeneous Functions with Scale Invariant Neural
Networks [28.2446416597989]
We first consider recovery of sparse vectors from few linear measurements.
We then extend our results to a wider class of recovery problems including low-rank matrix recovery and phase retrieval.
Our results shed some light on seeming contradiction between previous works showing that neural networks for inverse problems typically have very large Lipschitz constants.
arXiv Detail & Related papers (2023-08-05T10:17:04Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Gradient Descent Optimizes Infinite-Depth ReLU Implicit Networks with
Linear Widths [25.237054775800164]
This paper studies the convergence of gradient flow and gradient descent for nonlinear ReLU activated implicit networks.
We prove that both GF and GD converge to a global minimum at a linear rate if the width $m$ of the implicit network is textitlinear in the sample size.
arXiv Detail & Related papers (2022-05-16T06:07:56Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Overparameterization of deep ResNet: zero loss and mean-field analysis [19.45069138853531]
Finding parameters in a deep neural network (NN) that fit data is a non optimization problem.
We show that a basic first-order optimization method (gradient descent) finds a global solution with perfect fit in many practical situations.
We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.
arXiv Detail & Related papers (2021-05-30T02:46:09Z) - On Connectivity of Solutions in Deep Learning: The Role of
Over-parameterization and Feature Quality [21.13299067136635]
We present a novel condition for ensuring the connectivity of two arbitrary points in parameter space.
This condition is provably milder than dropout stability, and it provides a connection between the problem of finding low-loss paths and the memorization capacity of neural nets.
arXiv Detail & Related papers (2021-02-18T23:44:08Z) - Implicit Bias of Gradient Descent for Mean Squared Error Regression with
Two-Layer Wide Neural Networks [1.3706331473063877]
We show that the solution of training a width-$n$ shallow ReLU network is within $n- 1/2$ of the function which fits the training data.
We also show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.
arXiv Detail & Related papers (2020-06-12T17:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.