Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks
Trained with the Logistic Loss
- URL: http://arxiv.org/abs/2002.04486v4
- Date: Mon, 22 Jun 2020 15:50:22 GMT
- Title: Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks
Trained with the Logistic Loss
- Authors: Lenaic Chizat (LMO), Francis Bach (LIENS, SIERRA)
- Abstract summary: Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks.
We analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss
with gradient-based methods are observed to perform well in many supervised
classification tasks. Towards understanding this phenomenon, we analyze the
training and generalization behavior of infinitely wide two-layer neural
networks with homogeneous activations. We show that the limits of the gradient
flow on exponentially tailed losses can be fully characterized as a max-margin
classifier in a certain non-Hilbertian space of functions. In presence of
hidden low-dimensional structures, the resulting margin is independent of the
ambiant dimension, which leads to strong generalization bounds. In contrast,
training only the output layer implicitly solves a kernel support vector
machine, which a priori does not enjoy such an adaptivity. Our analysis of
training is non-quantitative in terms of running time but we prove
computational guarantees in simplified settings by showing equivalences with
online mirror descent. Finally, numerical experiments suggest that our analysis
describes well the practical behavior of two-layer neural networks with ReLU
activation and confirm the statistical benefits of this implicit bias.
Related papers
- Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks [13.983863226803336]
We argue that "Feature Averaging" is one of the principal factors contributing to non-robustness of deep neural networks.
We provide a detailed theoretical analysis of the training dynamics of gradient descent in a two-layer ReLU network for a binary classification task.
We prove that, with the provision of more granular supervised information, a two-layer multi-class neural network is capable of learning individual features.
arXiv Detail & Related papers (2024-10-14T09:28:32Z) - Benign Overfitting for Regression with Trained Two-Layer ReLU Networks [14.36840959836957]
We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow.
Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded.
arXiv Detail & Related papers (2024-10-08T16:54:23Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Convergence Analysis for Learning Orthonormal Deep Linear Neural
Networks [27.29463801531576]
We provide convergence analysis for training orthonormal deep linear neural networks.
Our results shed light on how increasing the number of hidden layers can impact the convergence speed.
arXiv Detail & Related papers (2023-11-24T18:46:54Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Convergence and Implicit Regularization Properties of Gradient Descent
for Deep Residual Networks [7.090165638014331]
We prove linear convergence of gradient descent to a global minimum for the training of deep residual networks with constant layer width and smooth activation function.
We show that the trained weights, as a function of the layer index, admits a scaling limit which is H"older continuous as the depth of the network tends to infinity.
arXiv Detail & Related papers (2022-04-14T22:50:28Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Proxy Convexity: A Unified Framework for the Analysis of Neural Networks
Trained by Gradient Descent [95.94432031144716]
We propose a unified non- optimization framework for the analysis of a learning network.
We show that existing guarantees can be trained unified through gradient descent.
arXiv Detail & Related papers (2021-06-25T17:45:00Z) - Universal scaling laws in the gradient descent training of neural
networks [10.508187462682308]
We show that the learning trajectory can be characterized by an explicit bounds at large training times.
Our results are based on spectral analysis of the evolution of a large network trained on the expected loss.
arXiv Detail & Related papers (2021-05-02T16:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.