Related papers: To Each Optimizer a Norm, To Each Norm its Generalization

To Each Optimizer a Norm, To Each Norm its Generalization

URL: http://arxiv.org/abs/2006.06821v1
Date: Thu, 11 Jun 2020 21:07:38 GMT
Title: To Each Optimizer a Norm, To Each Norm its Generalization
Authors: Sharan Vaswani, Reza Babanezhad, Jose Gallego-Posada, Aaron Mishkin, Simon Lacoste-Julien, Nicolas Le Roux
Abstract summary: We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes. We argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data results in better generalizations.
Score: 31.682969645989512
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes. Since it is difficult to determine whether an optimizer converges to solutions that minimize a known norm, we flip the problem and investigate what is the corresponding norm minimized by an interpolating solution. Using this reasoning, we prove that for over-parameterized linear regression, projections onto linear spans can be used to move between different interpolating solutions. For under-parameterized linear classification, we prove that for any linear classifier separating the data, there exists a family of quadratic norms ||.||_P such that the classifier's direction is the same as that of the maximum P-margin solution. For linear classification, we argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data results in better generalization. Furthermore, for over-parameterized linear classification, projections onto the data-span enable us to use techniques from the under-parameterized setting. On the empirical side, we propose techniques to bias optimizers towards better generalizing solutions, improving their test performance. We validate our theoretical results via synthetic experiments, and use the neural tangent kernel to handle non-linear models.

Related papers

Implicit bias of Normalized Steepest Descent in Multiclass Classification: Sign Descent, Spectral Descent, and Adam [33.082961718280245]
We characterize the implicit bias of Adam and Sign gradient descent (SignGD) in multi-class cross-entropy minimization. We generalize our analysis to p-norm normalized steepest descent (NSD) algorithms. A key insight is that the analysis of general entry-wise and Schatten p-norms can be reduced to the analysis of NSD with max-norm.
arXiv Detail & Related papers (2025-02-07T05:09:32Z)
Synergistic eigenanalysis of covariance and Hessian matrices for enhanced binary classification [72.77513633290056]
We present a novel approach that combines the eigenanalysis of a covariance matrix evaluated on a training set with a Hessian matrix evaluated on a deep learning model. Our method captures intricate patterns and relationships, enhancing classification performance.
arXiv Detail & Related papers (2024-02-14T16:10:42Z)
Regularized Linear Discriminant Analysis Using a Nonlinear Covariance Matrix Estimator [11.887333567383239]
Linear discriminant analysis (LDA) is a widely used technique for data classification. LDA becomes inefficient when the data covariance matrix is ill-conditioned. Regularized LDA methods have been proposed to cope with such a situation.
arXiv Detail & Related papers (2024-01-31T11:37:14Z)
Gradient-based bilevel optimization for multi-penalty Ridge regression through matrix differential calculus [0.46040036610482665]
We introduce a gradient-based approach to the problem of linear regression with l2-regularization. We show that our approach outperforms LASSO, Ridge, and Elastic Net regression. The analytical of the gradient proves to be more efficient in terms of computational time compared to automatic differentiation.
arXiv Detail & Related papers (2023-11-23T20:03:51Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
The Inductive Bias of Flatness Regularization for Deep Matrix Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks. We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z)
Linear Convergence of Reshuffling Kaczmarz Methods With Sparse Constraints [7.936519714074615]
The Kaczmarz matrix (KZ) and its variants have been extensively studied due to their simplicity and efficiency in solving sub linear equation systems. We provide the first theoretical convergence guarantees for KHT by showing that it converges linearly to the solution of a system with sparsity constraints.
arXiv Detail & Related papers (2023-04-20T07:14:24Z)
LQF: Linear Quadratic Fine-Tuning [114.3840147070712]
We present the first method for linearizing a pre-trained model that achieves comparable performance to non-linear fine-tuning. LQF consists of simple modifications to the architecture, loss function and optimization typically used for classification.
arXiv Detail & Related papers (2020-12-21T06:40:20Z)
Adaptive and Oblivious Randomized Subspace Methods for High-Dimensional Optimization: Sharp Analysis and Lower Bounds [37.03247707259297]
A suitable adaptive subspace can be generated by sampling a correlated random matrix whose second order statistics mirror the input data. We show that the relative error of the randomized approximations can be tightly characterized in terms of the spectrum of the data matrix. Experimental results show that the proposed approach enables significant speed ups in a wide variety of machine learning and optimization problems.
arXiv Detail & Related papers (2020-12-13T13:02:31Z)
Understanding Implicit Regularization in Over-Parameterized Single Index Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model. We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z)
Random extrapolation for primal-dual coordinate descent [61.55967255151027]
We introduce a randomly extrapolated primal-dual coordinate descent method that adapts to sparsity of the data matrix and the favorable structures of the objective function. We show almost sure convergence of the sequence and optimal sublinear convergence rates for the primal-dual gap and objective values, in the general convex-concave case.
arXiv Detail & Related papers (2020-07-13T17:39:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.