Related papers: Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

URL: http://arxiv.org/abs/2002.04486v4
Date: Mon, 22 Jun 2020 15:50:22 GMT
Title: Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss
Authors: Lenaic Chizat (LMO), Francis Bach (LIENS, SIERRA)
Abstract summary: Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. We analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias.

Related papers

Ultra-fast feature learning for the training of two-layer neural networks in the two-timescale regime [26.47265060394168]
We study the convergence of gradient methods for the training of mean-field single hidden layer neural networks with square loss. We obtain guarantees for the convergence of the trained feature distribution towards the teacher feature distribution in a teacher-student setup.
arXiv Detail & Related papers (2025-04-25T09:40:10Z)
Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks [13.983863226803336]
We argue that "Feature Averaging" is one of the principal factors contributing to non-robustness of deep neural networks. We provide a detailed theoretical analysis of the training dynamics of gradient descent in a two-layer ReLU network for a binary classification task. We prove that, with the provision of more granular supervised information, a two-layer multi-class neural network is capable of learning individual features.
arXiv Detail & Related papers (2024-10-14T09:28:32Z)
Benign Overfitting for Regression with Trained Two-Layer ReLU Networks [14.36840959836957]
We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow. Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded.
arXiv Detail & Related papers (2024-10-08T16:54:23Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Convergence Analysis for Learning Orthonormal Deep Linear Neural Networks [27.29463801531576]
We provide convergence analysis for training orthonormal deep linear neural networks. Our results shed light on how increasing the number of hidden layers can impact the convergence speed.
arXiv Detail & Related papers (2023-11-24T18:46:54Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Theoretical Characterization of How Neural Network Pruning Affects its Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero. More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z)
On Feature Learning in Neural Networks with Global Convergence Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF) We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z)
Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks [7.090165638014331]
We prove linear convergence of gradient descent to a global minimum for the training of deep residual networks with constant layer width and smooth activation function. We show that the trained weights, as a function of the layer index, admits a scaling limit which is H"older continuous as the depth of the network tends to infinity.
arXiv Detail & Related papers (2022-04-14T22:50:28Z)
The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data. We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z)
Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent [95.94432031144716]
We propose a unified non- optimization framework for the analysis of a learning network. We show that existing guarantees can be trained unified through gradient descent.
arXiv Detail & Related papers (2021-06-25T17:45:00Z)
Universal scaling laws in the gradient descent training of neural networks [10.508187462682308]
We show that the learning trajectory can be characterized by an explicit bounds at large training times. Our results are based on spectral analysis of the evolution of a large network trained on the expected loss.
arXiv Detail & Related papers (2021-05-02T16:46:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.