Benign overfitting in ridge regression
- URL: http://arxiv.org/abs/2009.14286v1
- Date: Tue, 29 Sep 2020 20:00:31 GMT
- Title: Benign overfitting in ridge regression
- Authors: A. Tsigler (1) and P. L. Bartlett (1) ((1) UC Berkeley)
- Abstract summary: We provide non-asymptotic generalization bounds for overparametrized ridge regression.
We identify when small or negative regularization is sufficient for obtaining small generalization error.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Classical learning theory suggests that strong regularization is needed to
learn a class with large complexity. This intuition is in contrast with the
modern practice of machine learning, in particular learning neural networks,
where the number of parameters often exceeds the number of data points. It has
been observed empirically that such overparametrized models can show good
generalization performance even if trained with vanishing or negative
regularization. The aim of this work is to understand theoretically how this
effect can occur, by studying the setting of ridge regression. We provide
non-asymptotic generalization bounds for overparametrized ridge regression that
depend on the arbitrary covariance structure of the data, and show that those
bounds are tight for a range of regularization parameter values. To our
knowledge this is the first work that studies overparametrized ridge regression
in such a general setting. We identify when small or negative regularization is
sufficient for obtaining small generalization error. On the technical side, our
bounds only require the data vectors to be i.i.d. sub-gaussian, while most
previous work assumes independence of the components of those vectors.
Related papers
- The Star Geometry of Critic-Based Regularizer Learning [2.2530496464901106]
Variational regularization is a technique to solve statistical inference tasks and inverse problems.
Recent works learn task-dependent regularizers by integrating information about the measurements and ground-truth data.
There is little theory about the structure of regularizers learned via this process and how it relates to the two data distributions.
arXiv Detail & Related papers (2024-08-29T18:34:59Z) - Generalization in Kernel Regression Under Realistic Assumptions [41.345620270267446]
We provide rigorous bounds for common kernels and for any amount of regularization, noise, any input dimension, and any number of samples.
Our results imply benign overfitting in high input dimensions, nearly tempered overfitting in fixed dimensions, and explicit convergence rates for regularized regression.
As a by-product, we obtain time-dependent bounds for neural networks trained in the kernel regime.
arXiv Detail & Related papers (2023-12-26T10:55:20Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - More Than a Toy: Random Matrix Models Predict How Real-World Neural
Representations Generalize [94.70343385404203]
We find that most theoretical analyses fall short of capturing qualitative phenomena even for kernel regression.
We prove that the classical GCV estimator converges to the generalization risk whenever a local random matrix law holds.
Our findings suggest that random matrix theory may be central to understanding the properties of neural representations in practice.
arXiv Detail & Related papers (2022-03-11T18:59:01Z) - More is Less: Inducing Sparsity via Overparameterization [2.885175627590247]
In deep learning it is common to over parameterize neural networks, that is, to use more parameters than training samples.
Quite surprisingly, generalize the neural network via (stochastic) gradient descent leads to that very well.
Our proof relies on analyzing a certain Bregman divergence of the flow.
arXiv Detail & Related papers (2021-12-21T07:55:55Z) - Harmless interpolation in regression and classification with structured
features [21.064512161584872]
Overparametrized neural networks tend to perfectly fit noisy training data yet generalize well on test data.
We present a general and flexible framework for upper bounding regression and classification risk in a reproducing kernel Hilbert space.
arXiv Detail & Related papers (2021-11-09T15:12:26Z) - Information-Theoretic Generalization Bounds for Iterative
Semi-Supervised Learning [81.1071978288003]
In particular, we seek to understand the behaviour of the em generalization error of iterative SSL algorithms using information-theoretic principles.
Our theoretical results suggest that when the class conditional variances are not too large, the upper bound on the generalization error decreases monotonically with the number of iterations, but quickly saturates.
arXiv Detail & Related papers (2021-10-03T05:38:49Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.