Implicit Regularization in ReLU Networks with the Square Loss
- URL: http://arxiv.org/abs/2012.05156v2
- Date: Tue, 15 Dec 2020 18:49:53 GMT
- Title: Implicit Regularization in ReLU Networks with the Square Loss
- Authors: Gal Vardi and Ohad Shamir
- Abstract summary: We show that it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters.
Our results suggest that a more general framework may be needed to understand implicit regularization for nonlinear predictors.
- Score: 56.70360094597169
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the implicit regularization (or implicit bias) of gradient
descent has recently been a very active research area. However, the implicit
regularization in nonlinear neural networks is still poorly understood,
especially for regression losses such as the square loss. Perhaps surprisingly,
we prove that even for a single ReLU neuron, it is impossible to characterize
the implicit regularization with the square loss by any explicit function of
the model parameters (although on the positive side, we show it can be
characterized approximately). For one hidden-layer networks, we prove a similar
result, where in general it is impossible to characterize implicit
regularization properties in this manner, except for the "balancedness"
property identified in Du et al. [2018]. Our results suggest that a more
general framework than the one considered so far may be needed to understand
implicit regularization for nonlinear predictors, and provides some clues on
what this framework should be.
Related papers
- Generalization for Least Squares Regression With Simple Spiked Covariances [3.9134031118910264]
The generalization properties of even two-layer neural networks trained by gradient descent remain poorly understood.
Recent work has made progress by describing the spectrum of the feature matrix at the hidden layer.
Yet, the generalization error for linear models with spiked covariances has not been previously determined.
arXiv Detail & Related papers (2024-10-17T19:46:51Z) - The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness
in ReLU Networks [64.12052498909105]
We study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks.
In two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples.
arXiv Detail & Related papers (2023-03-02T18:14:35Z) - Penalising the biases in norm regularisation enforces sparsity [28.86954341732928]
This work shows the parameters' norm required to represent a function is given by the total variation of its second derivative, weighted by a $sqrt1+x2$ factor.
Notably, this weighting factor disappears when the norm of bias terms is not regularised.
arXiv Detail & Related papers (2023-03-02T15:33:18Z) - Implicit Regularization for Group Sparsity [33.487964460794764]
We show that gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure.
We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates.
In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression.
arXiv Detail & Related papers (2023-01-29T20:54:03Z) - On generalization bounds for deep networks based on loss surface
implicit regularization [5.68558935178946]
Modern deep neural networks generalize well despite a large number of parameters.
That modern deep neural networks generalize well despite a large number of parameters contradicts the classical statistical learning theory.
arXiv Detail & Related papers (2022-01-12T16:41:34Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z) - Avoiding unwanted results in locally linear embedding: A new
understanding of regularization [1.0152838128195465]
Local linear embedding inherently admits some unwanted results when no regularization is used.
It is observed that all these bad results can be effectively prevented by using regularization.
arXiv Detail & Related papers (2021-08-28T17:23:47Z) - Interpolation can hurt robust generalization even when there is no noise [76.3492338989419]
We show that avoiding generalization through ridge regularization can significantly improve generalization even in the absence of noise.
We prove this phenomenon for the robust risk of both linear regression and classification and hence provide the first theoretical result on robust overfitting.
arXiv Detail & Related papers (2021-08-05T23:04:15Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Dimension Free Generalization Bounds for Non Linear Metric Learning [61.193693608166114]
We provide uniform generalization bounds for two regimes -- the sparse regime, and a non-sparse regime.
We show that by relying on a different, new property of the solutions, it is still possible to provide dimension free generalization guarantees.
arXiv Detail & Related papers (2021-02-07T14:47:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.