Benign overfitting in leaky ReLU networks with moderate input dimension
- URL: http://arxiv.org/abs/2403.06903v3
- Date: Wed, 02 Oct 2024 18:52:14 GMT
- Title: Benign overfitting in leaky ReLU networks with moderate input dimension
- Authors: Kedar Karhadkar, Erin George, Michael Murray, Guido Montúfar, Deanna Needell,
- Abstract summary: We study benign overfitting in two-layer leaky ReLU networks trained with the hinge loss on a binary classification task.
We characterize conditions on the signal to noise ratio (SNR) of the model parameters giving rise to benign versus non-benign (or harmful) overfitting.
- Score: 23.522101525839687
- License:
- Abstract: The problem of benign overfitting asks whether it is possible for a model to perfectly fit noisy training data and still generalize well. We study benign overfitting in two-layer leaky ReLU networks trained with the hinge loss on a binary classification task. We consider input data that can be decomposed into the sum of a common signal and a random noise component, that lie on subspaces orthogonal to one another. We characterize conditions on the signal to noise ratio (SNR) of the model parameters giving rise to benign versus non-benign (or harmful) overfitting: in particular, if the SNR is high then benign overfitting occurs, conversely if the SNR is low then harmful overfitting occurs. We attribute both benign and non-benign overfitting to an approximate margin maximization property and show that leaky ReLU networks trained on hinge loss with gradient descent (GD) satisfy this property. In contrast to prior work we do not require the training data to be nearly orthogonal. Notably, for input dimension $d$ and training sample size $n$, while results in prior work require $d = \Omega(n^2 \log n)$, here we require only $d = \Omega\left(n\right)$.
Related papers
- Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods [43.32546195968771]
We study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation.
Our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds.
We show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution.
arXiv Detail & Related papers (2024-10-13T21:49:29Z) - Benign Overfitting in Single-Head Attention [27.297696573634976]
We study benign overfitting in a single-head softmax attention model, which is the fundamental building block of Transformers.
We prove that under appropriate conditions, the model exhibits benign overfitting in a classification setting already after two steps of gradient descent.
arXiv Detail & Related papers (2024-10-10T09:23:33Z) - Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - Approximating Positive Homogeneous Functions with Scale Invariant Neural
Networks [28.2446416597989]
We first consider recovery of sparse vectors from few linear measurements.
We then extend our results to a wider class of recovery problems including low-rank matrix recovery and phase retrieval.
Our results shed some light on seeming contradiction between previous works showing that neural networks for inverse problems typically have very large Lipschitz constants.
arXiv Detail & Related papers (2023-08-05T10:17:04Z) - Noisy Interpolation Learning with Shallow Univariate ReLU Networks [33.900009202637285]
Mallinar et. al. 2022 noted that neural networks seem to often exhibit tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error.
We provide the first rigorous analysis of the overfitting behavior of regression with minimum weights.
arXiv Detail & Related papers (2023-07-28T08:41:12Z) - Deep Graph Neural Networks via Posteriori-Sampling-based Node-Adaptive Residual Module [65.81781176362848]
Graph Neural Networks (GNNs) can learn from graph-structured data through neighborhood information aggregation.
As the number of layers increases, node representations become indistinguishable, which is known as over-smoothing.
We propose a textbfPosterior-Sampling-based, Node-distinguish Residual module (PSNR).
arXiv Detail & Related papers (2023-05-09T12:03:42Z) - Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from
KKT Conditions for Margin Maximization [59.038366742773164]
Linears and leaky ReLU trained by gradient flow on logistic loss have an implicit bias towards satisfying the Karush-KuTucker (KKT) conditions.
In this work we establish a number of settings where the satisfaction of these conditions implies benign overfitting in linear classifiers and in two-layer leaky ReLU networks.
arXiv Detail & Related papers (2023-03-02T18:24:26Z) - Optimal Online Generalized Linear Regression with Stochastic Noise and
Its Application to Heteroscedastic Bandits [88.6139446295537]
We study the problem of online generalized linear regression in the setting of a generalized linear model with possibly unbounded additive noise.
We provide a sharp analysis of the classical follow-the-regularized-leader (FTRL) algorithm to cope with the label noise.
We propose an algorithm based on FTRL to achieve the first variance-aware regret bound.
arXiv Detail & Related papers (2022-02-28T08:25:26Z) - ReLU Regression with Massart Noise [52.10842036932169]
We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data.
We focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model.
We develop an efficient algorithm that achieves exact parameter recovery in this model.
arXiv Detail & Related papers (2021-09-10T02:13:22Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - The Generalized Lasso with Nonlinear Observations and Generative Priors [63.541900026673055]
We make the assumption of sub-Gaussian measurements, which is satisfied by a wide range of measurement models.
We show that our result can be extended to the uniform recovery guarantee under the assumption of a so-called local embedding property.
arXiv Detail & Related papers (2020-06-22T16:43:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.