Penalising the biases in norm regularisation enforces sparsity
- URL: http://arxiv.org/abs/2303.01353v3
- Date: Thu, 9 Nov 2023 09:32:45 GMT
- Title: Penalising the biases in norm regularisation enforces sparsity
- Authors: Etienne Boursier and Nicolas Flammarion
- Abstract summary: This work shows the parameters' norm required to represent a function is given by the total variation of its second derivative, weighted by a $sqrt1+x2$ factor.
Notably, this weighting factor disappears when the norm of bias terms is not regularised.
- Score: 28.86954341732928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Controlling the parameters' norm often yields good generalisation when
training neural networks. Beyond simple intuitions, the relation between
regularising parameters' norm and obtained estimators remains theoretically
misunderstood. For one hidden ReLU layer networks with unidimensional data,
this work shows the parameters' norm required to represent a function is given
by the total variation of its second derivative, weighted by a $\sqrt{1+x^2}$
factor. Notably, this weighting factor disappears when the norm of bias terms
is not regularised. The presence of this additional weighting factor is of
utmost significance as it is shown to enforce the uniqueness and sparsity (in
the number of kinks) of the minimal norm interpolator. Conversely, omitting the
bias' norm allows for non-sparse solutions. Penalising the bias terms in the
regularisation, either explicitly or implicitly, thus leads to sparse
estimators.
Related papers
- Minimum norm interpolation by perceptra: Explicit regularization and
implicit bias [0.3499042782396683]
We investigate how shallow ReLU networks interpolate between known regions.
We numerically study the implicit bias of common optimization algorithms towards known minimum norm interpolants.
arXiv Detail & Related papers (2023-11-10T15:55:47Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - On the Importance of Gradient Norm in PAC-Bayesian Bounds [92.82627080794491]
We propose a new generalization bound that exploits the contractivity of the log-Sobolev inequalities.
We empirically analyze the effect of this new loss-gradient norm term on different neural architectures.
arXiv Detail & Related papers (2022-10-12T12:49:20Z) - The Sample Complexity of One-Hidden-Layer Neural Networks [57.6421258363243]
We study a class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm.
We prove that controlling the spectral norm of the hidden layer weight matrix is insufficient to get uniform convergence guarantees.
We analyze two important settings where a mere spectral norm control turns out to be sufficient.
arXiv Detail & Related papers (2022-02-13T07:12:02Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Explicit regularization and implicit bias in deep network classifiers
trained with the square loss [2.8935588665357077]
Deep ReLU networks trained with the square loss have been observed to perform well in classification tasks.
We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques are used together with Weight Decay.
arXiv Detail & Related papers (2020-12-31T21:07:56Z) - Implicit Regularization in ReLU Networks with the Square Loss [56.70360094597169]
We show that it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters.
Our results suggest that a more general framework may be needed to understand implicit regularization for nonlinear predictors.
arXiv Detail & Related papers (2020-12-09T16:48:03Z) - Failures of model-dependent generalization bounds for least-norm
interpolation [39.97534972432276]
We consider bounds on the generalization performance of the least-norm linear regressor.
For a variety of natural joint distributions on training examples, any valid generalization bound must sometimes be very loose.
arXiv Detail & Related papers (2020-10-16T16:30:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.