Related papers: Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression

Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression

URL: http://arxiv.org/abs/2302.00257v2
Date: Fri, 26 May 2023 03:27:21 GMT
Title: Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression
Authors: Mo Zhou, Rong Ge
Abstract summary: In deep learning, often the training process finds an interpolator (a solution with 0 training loss) but the test loss is still low. One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator. We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss.
Score: 16.551664358490658
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In deep learning, often the training process finds an interpolator (a solution with 0 training loss), but the test loss is still low. This phenomenon, known as benign overfitting, is a major mystery that received a lot of recent attention. One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator, often characterized by minimizing certain norms. However, even for a simple sparse linear regression problem $y = \beta^{*\top} x +\xi$ with sparse $\beta^*$, neither minimum $\ell_1$ or $\ell_2$ norm interpolator gives the optimal test loss. In this work, we give a different parametrization of the model which leads to a new implicit regularization effect that combines the benefit of $\ell_1$ and $\ell_2$ interpolators. We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss. Our result is based on careful analysis of the training dynamics and provides another example of implicit regularization effect that goes beyond norm minimization.

Related papers

Linear regression with overparameterized linear neural networks: Tight upper and lower bounds for implicit $\ell^1$-regularization [3.902441198412341]
We study implicit regularization in diagonal linear neural networks of depth $Dge 2$ for overparameterized linear regression problems.<n>Our results reveal a qualitative difference between depths: for $D ge 3$, the error decreases linearly with $alpha$, whereas for $D=2$, it decreases at rate $alpha1-varrho$.<n> Numerical experiments corroborate our theoretical findings and suggest that deeper networks, i.e., $D ge 3$, may lead to better generalization.
arXiv Detail & Related papers (2025-06-01T19:55:31Z)
Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions. We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks. In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z)
Robust Capped lp-Norm Support Vector Ordinal Regression [85.84718111830752]
Ordinal regression is a specialized supervised problem where the labels show an inherent order. Support Vector Ordinal Regression, as an outstanding ordinal regression model, is widely used in many ordinal regression tasks. We introduce a new model, Capped $ell_p$-Norm Support Vector Ordinal Regression(CSVOR), that is robust to outliers.
arXiv Detail & Related papers (2024-04-25T13:56:05Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Noisy Interpolation Learning with Shallow Univariate ReLU Networks [33.900009202637285]
Mallinar et. al. 2022 noted that neural networks seem to often exhibit tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error. We provide the first rigorous analysis of the overfitting behavior of regression with minimum weights.
arXiv Detail & Related papers (2023-07-28T08:41:12Z)
The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks [117.93273337740442]
We show that gradient descent converges to a uniform margin classifier on the training data with an $exp(-Omega(log2 t))$ convergence rate. We also show that batch normalization has an implicit bias towards a patch-wise uniform margin.
arXiv Detail & Related papers (2023-06-20T16:58:00Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing [30.508036898655114]
Pruning schemes have been widely used in practice to reduce the complexity of trained models with a massive number of parameters. Running gradient descent in the absence of regularization results in models which are not suitable for greedy pruning, i.e., many columns could have their $ell$ norm comparable to that of the maximum. Our results provide the first rigorous insights on why greedy pruning + fine-tuning leads to smaller models which also generalize well.
arXiv Detail & Related papers (2023-03-20T21:05:44Z)
Generalization and Stability of Interpolating Neural Networks with Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime. We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$. With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z)
Minimum $\ell_{1}$-norm interpolators: Precise asymptotics and multiple descent [19.781475554462553]
This paper pursues theoretical understanding for an important type of interpolators: the minimum $ell_1$-norm interpolator. We observe, and provide rigorous theoretical justification for, a curious multi-descent phenomenon. Our finding is built upon an exact characterization of the risk behavior, which is governed by a system of two non-linear equations with two unknowns.
arXiv Detail & Related papers (2021-10-18T17:51:14Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
Online nonparametric regression with Sobolev kernels [99.12817345416846]
We derive the regret upper bounds on the classes of Sobolev spaces $W_pbeta(mathcalX)$, $pgeq 2, beta>fracdp$. The upper bounds are supported by the minimax regret analysis, which reveals that in the cases $beta> fracd2$ or $p=infty$ these rates are (essentially) optimal.
arXiv Detail & Related papers (2021-02-06T15:05:14Z)
Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data. Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.