When does gradient descent with logistic loss find interpolating
two-layer networks?
- URL: http://arxiv.org/abs/2012.02409v2
- Date: Thu, 14 Jan 2021 04:52:04 GMT
- Title: When does gradient descent with logistic loss find interpolating
two-layer networks?
- Authors: Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett
- Abstract summary: We show that gradient descent drives the training loss to zero if the initial loss is small enough.
When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
- Score: 51.1848572349154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the training of finite-width two-layer smoothed ReLU networks for
binary classification using the logistic loss. We show that gradient descent
drives the training loss to zero if the initial loss is small enough. When the
data satisfies certain cluster and separation conditions and the network is
wide enough, we show that one step of gradient descent reduces the loss
sufficiently that the first result applies.
Related papers
- Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Convergence of gradient descent for learning linear neural networks [2.209921757303168]
We show that gradient descent converges to a critical point of the loss function, i.e., the square loss in this article.
In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank.
arXiv Detail & Related papers (2021-08-04T13:10:30Z) - When does gradient descent with logistic loss interpolate using deep
networks with smoothed ReLU activations? [51.1848572349154]
We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero.
Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU.
arXiv Detail & Related papers (2021-02-09T18:04:37Z) - Implicit bias of deep linear networks in the large learning rate phase [15.846533303963229]
We characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in a large learning rate regime.
We claim that depending on the separation conditions of data, the gradient descent iterates will converge to a flatter minimum in the catapult phase.
arXiv Detail & Related papers (2020-11-25T06:50:30Z) - Regularizing Semi-supervised Graph Convolutional Networks with a
Manifold Smoothness Loss [12.948899990826426]
We propose an unsupervised manifold smoothness loss defined with respect to the graph structure, which can be added to the loss function as a regularization.
We conduct experiments on multi-layer perceptron and existing graph networks, and demonstrate that adding the proposed loss can improve the performance consistently.
arXiv Detail & Related papers (2020-02-11T08:51:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.