Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks
- URL: http://arxiv.org/abs/2505.16204v2
- Date: Mon, 06 Oct 2025 02:21:14 GMT
- Title: Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks
- Authors: Ichiro Hashimoto,
- Abstract summary: We study benign overfitting of leaky ReLU two-layer neural network classifiers trained on mixture data via descent gradient.<n>We provide both, upper and lower classification error bounds, and discover a phase transition in the bound as a function of signal strength.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we study benign overfitting of fixed width leaky ReLU two-layer neural network classifiers trained on mixture data via gradient descent. We provide both, upper and lower classification error bounds, and discover a phase transition in the bound as a function of signal strength. The lower bound leads to a characterization of cases when benign overfitting provably fails even if directional convergence occurs. Our analysis allows us to considerably relax the distributional assumptions that are made in existing work on benign overfitting of leaky ReLU two-layer neural network classifiers. We can allow for non-sub-Gaussian data and do not require near orthogonality. Our results are derived by establishing directional convergence of the network parameters and studying classification error bounds for the convergent direction. Previously, directional convergence in (leaky) ReLU neural networks was established only for gradient flow. By first establishing directional convergence, we are able to study benign overfitting of fixed width leaky ReLU two-layer neural network classifiers in a much wider range of scenarios than was done before.
Related papers
- Benign Overfitting for Regression with Trained Two-Layer ReLU Networks [14.36840959836957]
We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow.
Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded.
arXiv Detail & Related papers (2024-10-08T16:54:23Z) - On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.<n>We provide a proof of this in the case of linear neural networks with a squared loss.<n>We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Approximation Results for Gradient Descent trained Neural Networks [0.0]
The networks are fully connected constant depth increasing width.
The continuous kernel error norm implies an approximation under the natural smoothness assumption required for smooth functions.
arXiv Detail & Related papers (2023-09-09T18:47:55Z) - Benign Overfitting for Two-layer ReLU Convolutional Neural Networks [60.19739010031304]
We establish algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise.
We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk.
arXiv Detail & Related papers (2023-03-07T18:59:38Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Improved Overparametrization Bounds for Global Convergence of Stochastic
Gradient Descent for Shallow Neural Networks [1.14219428942199]
We study the overparametrization bounds required for the global convergence of gradient descent algorithm for a class of one hidden layer feed-forward neural networks.
arXiv Detail & Related papers (2022-01-28T11:30:06Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Directional Convergence Analysis under Spherically Symmetric
Distribution [21.145823611499104]
We consider the fundamental problem of learning linear predictors (i.e., separable datasets with zero margin) using neural networks with gradient flow or gradient descent.
We show directional convergence guarantees with exact convergence rate for two-layer non-linear networks with only two hidden nodes, and (deep) linear networks.
arXiv Detail & Related papers (2021-05-09T08:59:58Z) - When does gradient descent with logistic loss interpolate using deep
networks with smoothed ReLU activations? [51.1848572349154]
We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero.
Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU.
arXiv Detail & Related papers (2021-02-09T18:04:37Z) - How Implicit Regularization of ReLU Neural Networks Characterizes the
Learned Function -- Part I: the 1-D Case of Two Layers with Random First
Layer [5.969858080492586]
We consider one dimensional (shallow) ReLU neural networks in which weights are chosen randomly and only the terminal layer is trained.
We show that for such networks L2-regularized regression corresponds in function space to regularizing the estimate's second derivative for fairly general loss functionals.
arXiv Detail & Related papers (2019-11-07T13:48:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.