Implicit Regularization Leads to Benign Overfitting for Sparse Linear
Regression
- URL: http://arxiv.org/abs/2302.00257v2
- Date: Fri, 26 May 2023 03:27:21 GMT
- Title: Implicit Regularization Leads to Benign Overfitting for Sparse Linear
Regression
- Authors: Mo Zhou, Rong Ge
- Abstract summary: In deep learning, often the training process finds an interpolator (a solution with 0 training loss) but the test loss is still low.
One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator.
We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss.
- Score: 16.551664358490658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In deep learning, often the training process finds an interpolator (a
solution with 0 training loss), but the test loss is still low. This
phenomenon, known as benign overfitting, is a major mystery that received a lot
of recent attention. One common mechanism for benign overfitting is implicit
regularization, where the training process leads to additional properties for
the interpolator, often characterized by minimizing certain norms. However,
even for a simple sparse linear regression problem $y = \beta^{*\top} x +\xi$
with sparse $\beta^*$, neither minimum $\ell_1$ or $\ell_2$ norm interpolator
gives the optimal test loss. In this work, we give a different parametrization
of the model which leads to a new implicit regularization effect that combines
the benefit of $\ell_1$ and $\ell_2$ interpolators. We show that training our
new model via gradient descent leads to an interpolator with near-optimal test
loss. Our result is based on careful analysis of the training dynamics and
provides another example of implicit regularization effect that goes beyond
norm minimization.
Related papers
- Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - Robust Capped lp-Norm Support Vector Ordinal Regression [85.84718111830752]
Ordinal regression is a specialized supervised problem where the labels show an inherent order.
Support Vector Ordinal Regression, as an outstanding ordinal regression model, is widely used in many ordinal regression tasks.
We introduce a new model, Capped $ell_p$-Norm Support Vector Ordinal Regression(CSVOR), that is robust to outliers.
arXiv Detail & Related papers (2024-04-25T13:56:05Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Noisy Interpolation Learning with Shallow Univariate ReLU Networks [33.900009202637285]
Mallinar et. al. 2022 noted that neural networks seem to often exhibit tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error.
We provide the first rigorous analysis of the overfitting behavior of regression with minimum weights.
arXiv Detail & Related papers (2023-07-28T08:41:12Z) - The Implicit Bias of Batch Normalization in Linear Models and Two-layer
Linear Convolutional Neural Networks [117.93273337740442]
We show that gradient descent converges to a uniform margin classifier on the training data with an $exp(-Omega(log2 t))$ convergence rate.
We also show that batch normalization has an implicit bias towards a patch-wise uniform margin.
arXiv Detail & Related papers (2023-06-20T16:58:00Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing [30.508036898655114]
Pruning schemes have been widely used in practice to reduce the complexity of trained models with a massive number of parameters.
Running gradient descent in the absence of regularization results in models which are not suitable for greedy pruning, i.e., many columns could have their $ell$ norm comparable to that of the maximum.
Our results provide the first rigorous insights on why greedy pruning + fine-tuning leads to smaller models which also generalize well.
arXiv Detail & Related papers (2023-03-20T21:05:44Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Minimum $\ell_{1}$-norm interpolators: Precise asymptotics and multiple
descent [19.781475554462553]
This paper pursues theoretical understanding for an important type of interpolators: the minimum $ell_1$-norm interpolator.
We observe, and provide rigorous theoretical justification for, a curious multi-descent phenomenon.
Our finding is built upon an exact characterization of the risk behavior, which is governed by a system of two non-linear equations with two unknowns.
arXiv Detail & Related papers (2021-10-18T17:51:14Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Online nonparametric regression with Sobolev kernels [99.12817345416846]
We derive the regret upper bounds on the classes of Sobolev spaces $W_pbeta(mathcalX)$, $pgeq 2, beta>fracdp$.
The upper bounds are supported by the minimax regret analysis, which reveals that in the cases $beta> fracd2$ or $p=infty$ these rates are (essentially) optimal.
arXiv Detail & Related papers (2021-02-06T15:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.