Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity
Bias
- URL: http://arxiv.org/abs/2110.13905v1
- Date: Tue, 26 Oct 2021 17:57:57 GMT
- Title: Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity
Bias
- Authors: Kaifeng Lyu, Zhiyuan Li, Runzhe Wang, Sanjeev Arora
- Abstract summary: Real-life neural networks are from small random values and trained with cross-entropy loss for classification.
Recent results show that gradient descent converges to the "max-margin" solution with zero loss, which presumably generalizes well.
The current paper is able to establish this global optimality for two-layer ReLU nets trained with gradient flow on linearly separable and symmetric data.
- Score: 34.81794649454105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The generalization mystery of overparametrized deep nets has motivated
efforts to understand how gradient descent (GD) converges to low-loss solutions
that generalize well. Real-life neural networks are initialized from small
random values and trained with cross-entropy loss for classification (unlike
the "lazy" or "NTK" regime of training where analysis was more successful), and
a recent sequence of results (Lyu and Li, 2020; Chizat and Bach, 2020; Ji and
Telgarsky, 2020) provide theoretical evidence that GD may converge to the
"max-margin" solution with zero loss, which presumably generalizes well.
However, the global optimality of margin is proved only in some settings where
neural nets are infinitely or exponentially wide. The current paper is able to
establish this global optimality for two-layer Leaky ReLU nets trained with
gradient flow on linearly separable and symmetric data, regardless of the
width. The analysis also gives some theoretical justification for recent
empirical findings (Kalimeris et al., 2019) on the so-called simplicity bias of
GD towards linear or other "simple" classes of solutions, especially early in
training. On the pessimistic side, the paper suggests that such results are
fragile. A simple data manipulation can make gradient flow converge to a linear
classifier with suboptimal margin.
Related papers
- Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Implicit Regularization Towards Rank Minimization in ReLU Networks [34.41953136999683]
We study the conjectured relationship between the implicit regularization in neural networks and rank minimization.
We focus on nonlinear ReLU networks, providing several new positive and negative results.
arXiv Detail & Related papers (2022-01-30T09:15:44Z) - A global convergence theory for deep ReLU implicit networks via
over-parameterization [26.19122384935622]
Implicit deep learning has received increasing attention recently.
This paper analyzes the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks.
arXiv Detail & Related papers (2021-10-11T23:22:50Z) - Distribution of Classification Margins: Are All Data Equal? [61.16681488656473]
We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization.
The resulting subset of "high capacity" features is not consistent across different training runs.
arXiv Detail & Related papers (2021-07-21T16:41:57Z) - Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent [2.7793394375935088]
We prove that two-layer (Leaky)ReLU networks by e.g. the widely used method proposed by He et al. are not consistent.
arXiv Detail & Related papers (2020-02-12T09:22:45Z) - The Implicit Bias of Gradient Descent on Separable Data [44.98410310356165]
We show the predictor converges to the direction of the max-margin (hard margin SVM) solution.
This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero.
arXiv Detail & Related papers (2017-10-27T21:47:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.