Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data
- URL: http://arxiv.org/abs/2210.07082v1
- Date: Thu, 13 Oct 2022 15:09:54 GMT
- Title: Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data
- Authors: Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro, Wei Hu
- Abstract summary: In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
- Score: 63.34506218832164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The implicit biases of gradient-based optimization algorithms are conjectured
to be a major factor in the success of modern deep learning. In this work, we
investigate the implicit bias of gradient flow and gradient descent in
two-layer fully-connected neural networks with leaky ReLU activations when the
training data are nearly-orthogonal, a common property of high-dimensional
data. For gradient flow, we leverage recent work on the implicit bias for
homogeneous neural networks to show that asymptotically, gradient flow produces
a neural network with rank at most two. Moreover, this network is an
$\ell_2$-max-margin solution (in parameter space), and has a linear decision
boundary that corresponds to an approximate-max-margin linear predictor. For
gradient descent, provided the random initialization variance is small enough,
we show that a single step of gradient descent suffices to drastically reduce
the rank of the network, and that the rank remains small throughout training.
We provide experiments which suggest that a small initialization scale is
important for finding low-rank neural networks with gradient descent.
Related papers
- Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Continuous vs. Discrete Optimization of Deep Neural Networks [15.508460240818575]
We show that over deep neural networks with homogeneous activations, gradient flow trajectories enjoy favorable curvature.
This finding allows us to translate an analysis of gradient flow over deep linear neural networks into a guarantee that gradient descent efficiently converges to global minimum.
We hypothesize that the theory of gradient flows will be central to unraveling mysteries behind deep learning.
arXiv Detail & Related papers (2021-07-14T10:59:57Z) - Vanishing Curvature and the Power of Adaptive Methods in Randomly
Initialized Deep Networks [30.467121747150816]
This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly neural networks.
We first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth)
arXiv Detail & Related papers (2021-06-07T16:29:59Z) - Convergence rates for gradient descent in the training of
overparameterized artificial neural networks with biases [3.198144010381572]
In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches.
It is still unclear why randomly gradient descent algorithms reach their limits.
arXiv Detail & Related papers (2021-02-23T18:17:47Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent [2.7793394375935088]
We prove that two-layer (Leaky)ReLU networks by e.g. the widely used method proposed by He et al. are not consistent.
arXiv Detail & Related papers (2020-02-12T09:22:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.