Related papers: When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

URL: http://arxiv.org/abs/2102.04998v1
Date: Tue, 9 Feb 2021 18:04:37 GMT
Title: When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?
Authors: Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett
Abstract summary: We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU.
Score: 51.1848572349154
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses.

Related papers

Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks [0.0]
We prove directional convergence of network parameters of fixed width leaky ReLU two-layer neural networks optimized by gradient descent with exponential loss.<n>As an application, we demonstrate that benign overfitting occurs with high probability in sub-Gaussian mixture models.
arXiv Detail & Related papers (2025-05-22T04:11:58Z)
A Local Polyak-Lojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models [6.734175048463699]
We derive a linear convergence rate for gradient descent for two-layer linear neural networks trained with squared loss.<n>Our convergence analysis not only improves upon prior results but also suggests a better choice for the step size.
arXiv Detail & Related papers (2025-05-16T19:57:22Z)
Benign Overfitting for Regression with Trained Two-Layer ReLU Networks [14.36840959836957]
We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow. Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded.
arXiv Detail & Related papers (2024-10-08T16:54:23Z)
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z)
The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks [53.95175206863992]
We study the type of solutions to which gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss. We prove that although shallow ReLU networks are universal approximators, stable shallow networks are not.
arXiv Detail & Related papers (2023-06-30T09:17:39Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Gradient descent provably escapes saddle points in the training of shallow ReLU networks [6.458742319938318]
We prove a variant of the relevant dynamical systems result, a center-stable manifold theorem, in which we relax some of the regularity requirements. Building on a detailed examination of critical points of the square integral loss function for shallow ReLU and leaky ReLU networks, we show that gradient descents most saddle points.
arXiv Detail & Related papers (2022-08-03T14:08:52Z)
Improved Overparametrization Bounds for Global Convergence of Stochastic Gradient Descent for Shallow Neural Networks [1.14219428942199]
We study the overparametrization bounds required for the global convergence of gradient descent algorithm for a class of one hidden layer feed-forward neural networks.
arXiv Detail & Related papers (2022-01-28T11:30:06Z)
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent. We show that SGD is biased towards a simple solution. We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
When does gradient descent with logistic loss find interpolating two-layer networks? [51.1848572349154]
We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
arXiv Detail & Related papers (2020-12-04T05:16:51Z)
Asymptotic convergence rate of Dropout on shallow linear neural networks [0.0]
We analyze the convergence on objective functions induced by Dropout and Dropconnect, when applying them to shallow linear Neural Networks. We obtain a local convergence proof of the gradient flow and a bound on the rate that depends on the data, the rate probability, and the width of the NN.
arXiv Detail & Related papers (2020-12-01T19:02:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.