The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization
- URL: http://arxiv.org/abs/2306.13239v1
- Date: Thu, 22 Jun 2023 23:14:57 GMT
- Title: The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization
- Authors: Khashayar Gatmiry, Zhiyuan Li, Ching-Yao Chuang, Sashank Reddi, Tengyu
Ma, Stefanie Jegelka
- Abstract summary: This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
- Score: 58.851514333119255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works on over-parameterized neural networks have shown that the
stochasticity in optimizers has the implicit regularization effect of
minimizing the sharpness of the loss function (in particular, the trace of its
Hessian) over the family zero-loss solutions. More explicit forms of flatness
regularization also empirically improve the generalization performance.
However, it remains unclear why and when flatness regularization leads to
better generalization. This work takes the first step toward understanding the
inductive bias of the minimum trace of the Hessian solutions in an important
setting: learning deep linear networks from linear measurements, also known as
\emph{deep matrix factorization}. We show that for all depth greater than one,
with the standard Restricted Isometry Property (RIP) on the measurements,
minimizing the trace of Hessian is approximately equivalent to minimizing the
Schatten 1-norm of the corresponding end-to-end matrix parameters (i.e., the
product of all layer matrices), which in turn leads to better generalization.
We empirically verify our theoretical findings on synthetic datasets.
Related papers
- Entrywise error bounds for low-rank approximations of kernel matrices [55.524284152242096]
We derive entrywise error bounds for low-rank approximations of kernel matrices obtained using the truncated eigen-decomposition.
A key technical innovation is a delocalisation result for the eigenvectors of the kernel matrix corresponding to small eigenvalues.
We validate our theory with an empirical study of a collection of synthetic and real-world datasets.
arXiv Detail & Related papers (2024-05-23T12:26:25Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - FAM: Relative Flatness Aware Minimization [5.132856559837775]
optimizing for flatness has been proposed as early as 1994 by Hochreiter and Schmidthuber.
Recent theoretical work suggests that a particular relative flatness measure can be connected to generalization.
We derive a regularizer based on this relative flatness that is easy to compute, fast, efficient, and works with arbitrary loss functions.
arXiv Detail & Related papers (2023-07-05T14:48:24Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Demystifying the Global Convergence Puzzle of Learning
Over-parameterized ReLU Nets in Very High Dimensions [1.3401746329218014]
This paper is devoted to rigorous theory for demystifying the global convergence phenomenon in a challenging scenario: learning over-dimensionalized data.
A major ingredient of our theory is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that
arXiv Detail & Related papers (2022-06-05T02:14:21Z) - Flat minima generalize for low-rank matrix recovery [16.956238550063365]
We show that flat minima, measured by the trace of the Hessian, exactly recover the ground truth under standard statistical assumptions.
For matrix completion, we establish weak recovery, although empirical evidence suggests exact recovery holds here as well.
arXiv Detail & Related papers (2022-03-07T22:35:20Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Understanding Implicit Regularization in Over-Parameterized Single Index
Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model.
We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z) - To Each Optimizer a Norm, To Each Norm its Generalization [31.682969645989512]
We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes.
We argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data results in better generalizations.
arXiv Detail & Related papers (2020-06-11T21:07:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.