To Each Optimizer a Norm, To Each Norm its Generalization
- URL: http://arxiv.org/abs/2006.06821v1
- Date: Thu, 11 Jun 2020 21:07:38 GMT
- Title: To Each Optimizer a Norm, To Each Norm its Generalization
- Authors: Sharan Vaswani, Reza Babanezhad, Jose Gallego-Posada, Aaron Mishkin,
Simon Lacoste-Julien, Nicolas Le Roux
- Abstract summary: We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes.
We argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data results in better generalizations.
- Score: 31.682969645989512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the implicit regularization of optimization methods for linear
models interpolating the training data in the under-parametrized and
over-parametrized regimes. Since it is difficult to determine whether an
optimizer converges to solutions that minimize a known norm, we flip the
problem and investigate what is the corresponding norm minimized by an
interpolating solution. Using this reasoning, we prove that for
over-parameterized linear regression, projections onto linear spans can be used
to move between different interpolating solutions. For under-parameterized
linear classification, we prove that for any linear classifier separating the
data, there exists a family of quadratic norms ||.||_P such that the
classifier's direction is the same as that of the maximum P-margin solution.
For linear classification, we argue that analyzing convergence to the standard
maximum l2-margin is arbitrary and show that minimizing the norm induced by the
data results in better generalization. Furthermore, for over-parameterized
linear classification, projections onto the data-span enable us to use
techniques from the under-parameterized setting. On the empirical side, we
propose techniques to bias optimizers towards better generalizing solutions,
improving their test performance. We validate our theoretical results via
synthetic experiments, and use the neural tangent kernel to handle non-linear
models.
Related papers
- Synergistic eigenanalysis of covariance and Hessian matrices for enhanced binary classification [72.77513633290056]
We present a novel approach that combines the eigenanalysis of a covariance matrix evaluated on a training set with a Hessian matrix evaluated on a deep learning model.
Our method captures intricate patterns and relationships, enhancing classification performance.
arXiv Detail & Related papers (2024-02-14T16:10:42Z) - Regularized Linear Discriminant Analysis Using a Nonlinear Covariance
Matrix Estimator [11.887333567383239]
Linear discriminant analysis (LDA) is a widely used technique for data classification.
LDA becomes inefficient when the data covariance matrix is ill-conditioned.
Regularized LDA methods have been proposed to cope with such a situation.
arXiv Detail & Related papers (2024-01-31T11:37:14Z) - Gradient-based bilevel optimization for multi-penalty Ridge regression
through matrix differential calculus [0.46040036610482665]
We introduce a gradient-based approach to the problem of linear regression with l2-regularization.
We show that our approach outperforms LASSO, Ridge, and Elastic Net regression.
The analytical of the gradient proves to be more efficient in terms of computational time compared to automatic differentiation.
arXiv Detail & Related papers (2023-11-23T20:03:51Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Linear Convergence of Reshuffling Kaczmarz Methods With Sparse
Constraints [7.936519714074615]
The Kaczmarz matrix (KZ) and its variants have been extensively studied due to their simplicity and efficiency in solving sub linear equation systems.
We provide the first theoretical convergence guarantees for KHT by showing that it converges linearly to the solution of a system with sparsity constraints.
arXiv Detail & Related papers (2023-04-20T07:14:24Z) - LQF: Linear Quadratic Fine-Tuning [114.3840147070712]
We present the first method for linearizing a pre-trained model that achieves comparable performance to non-linear fine-tuning.
LQF consists of simple modifications to the architecture, loss function and optimization typically used for classification.
arXiv Detail & Related papers (2020-12-21T06:40:20Z) - Adaptive and Oblivious Randomized Subspace Methods for High-Dimensional
Optimization: Sharp Analysis and Lower Bounds [37.03247707259297]
A suitable adaptive subspace can be generated by sampling a correlated random matrix whose second order statistics mirror the input data.
We show that the relative error of the randomized approximations can be tightly characterized in terms of the spectrum of the data matrix.
Experimental results show that the proposed approach enables significant speed ups in a wide variety of machine learning and optimization problems.
arXiv Detail & Related papers (2020-12-13T13:02:31Z) - Understanding Implicit Regularization in Over-Parameterized Single Index
Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model.
We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z) - Random extrapolation for primal-dual coordinate descent [61.55967255151027]
We introduce a randomly extrapolated primal-dual coordinate descent method that adapts to sparsity of the data matrix and the favorable structures of the objective function.
We show almost sure convergence of the sequence and optimal sublinear convergence rates for the primal-dual gap and objective values, in the general convex-concave case.
arXiv Detail & Related papers (2020-07-13T17:39:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.