Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data
- URL: http://arxiv.org/abs/2310.18935v1
- Date: Sun, 29 Oct 2023 08:47:48 GMT
- Title: Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data
- Authors: Yiwen Kou and Zixiang Chen and Quanquan Gu
- Abstract summary: The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
- Score: 66.1211659120882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The implicit bias towards solutions with favorable properties is believed to
be a key reason why neural networks trained by gradient-based optimization can
generalize well. While the implicit bias of gradient flow has been widely
studied for homogeneous neural networks (including ReLU and leaky ReLU
networks), the implicit bias of gradient descent is currently only understood
for smooth neural networks. Therefore, implicit bias in non-smooth neural
networks trained by gradient descent remains an open question. In this paper,
we aim to answer this question by studying the implicit bias of gradient
descent for training two-layer fully connected (leaky) ReLU neural networks. We
showed that when the training data are nearly-orthogonal, for leaky ReLU
activation function, gradient descent will find a network with a stable rank
that converges to $1$, whereas for ReLU activation function, gradient descent
will find a neural network with a stable rank that is upper bounded by a
constant. Additionally, we show that gradient descent will find a neural
network such that all the training data points have the same normalized margin
asymptotically. Experiments on both synthetic and real data backup our
theoretical findings.
Related papers
- Benign Overfitting for Regression with Trained Two-Layer ReLU Networks [14.36840959836957]
We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow.
Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded.
arXiv Detail & Related papers (2024-10-08T16:54:23Z) - Benign Overfitting for Two-layer ReLU Convolutional Neural Networks [60.19739010031304]
We establish algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise.
We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk.
arXiv Detail & Related papers (2023-03-07T18:59:38Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity
on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function.
We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z) - A global convergence theory for deep ReLU implicit networks via
over-parameterization [26.19122384935622]
Implicit deep learning has received increasing attention recently.
This paper analyzes the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks.
arXiv Detail & Related papers (2021-10-11T23:22:50Z) - Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent [2.7793394375935088]
We prove that two-layer (Leaky)ReLU networks by e.g. the widely used method proposed by He et al. are not consistent.
arXiv Detail & Related papers (2020-02-12T09:22:45Z) - How Implicit Regularization of ReLU Neural Networks Characterizes the
Learned Function -- Part I: the 1-D Case of Two Layers with Random First
Layer [5.969858080492586]
We consider one dimensional (shallow) ReLU neural networks in which weights are chosen randomly and only the terminal layer is trained.
We show that for such networks L2-regularized regression corresponds in function space to regularizing the estimate's second derivative for fairly general loss functionals.
arXiv Detail & Related papers (2019-11-07T13:48:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.