On Generalization of Adaptive Methods for Over-parameterized Linear
Regression
- URL: http://arxiv.org/abs/2011.14066v1
- Date: Sat, 28 Nov 2020 04:19:32 GMT
- Title: On Generalization of Adaptive Methods for Over-parameterized Linear
Regression
- Authors: Vatsal Shah, Soumya Basu, Anastasios Kyrillidis, Sujay Sanghavi
- Abstract summary: We aim to characterize the performance of adaptive methods in the over- parameterized linear regression setting.
Our experiments on over- parameterized linear regression and deep neural networks support this theory.
- Score: 27.156348760303864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over-parameterization and adaptive methods have played a crucial role in the
success of deep learning in the last decade. The widespread use of
over-parameterization has forced us to rethink generalization by bringing forth
new phenomena, such as implicit regularization of optimization algorithms and
double descent with training progression. A series of recent works have started
to shed light on these areas in the quest to understand -- why do neural
networks generalize well? The setting of over-parameterized linear regression
has provided key insights into understanding this mysterious behavior of neural
networks.
In this paper, we aim to characterize the performance of adaptive methods in
the over-parameterized linear regression setting. First, we focus on two
sub-classes of adaptive methods depending on their generalization performance.
For the first class of adaptive methods, the parameter vector remains in the
span of the data and converges to the minimum norm solution like gradient
descent (GD). On the other hand, for the second class of adaptive methods, the
gradient rotation caused by the pre-conditioner matrix results in an in-span
component of the parameter vector that converges to the minimum norm solution
and the out-of-span component that saturates. Our experiments on
over-parameterized linear regression and deep neural networks support this
theory.
Related papers
- Gradient-based bilevel optimization for multi-penalty Ridge regression
through matrix differential calculus [0.46040036610482665]
We introduce a gradient-based approach to the problem of linear regression with l2-regularization.
We show that our approach outperforms LASSO, Ridge, and Elastic Net regression.
The analytical of the gradient proves to be more efficient in terms of computational time compared to automatic differentiation.
arXiv Detail & Related papers (2023-11-23T20:03:51Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Hebbian learning inspired estimation of the linear regression parameters
from queries [18.374824005225186]
We study a variation of this Hebbian learning rule to recover the regression vector in the linear regression model.
We prove that this Hebbian learning rule can achieve considerably faster rates than any non-adaptive method that selects the queries independently of the data.
arXiv Detail & Related papers (2023-09-26T19:00:32Z) - Implicit Regularization for Group Sparsity [33.487964460794764]
We show that gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure.
We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates.
In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression.
arXiv Detail & Related papers (2023-01-29T20:54:03Z) - Memory-Efficient Backpropagation through Large Linear Layers [107.20037639738433]
In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass.
This study proposes a memory reduction approach to perform backpropagation through linear layers.
arXiv Detail & Related papers (2022-01-31T13:02:41Z) - Error-Correcting Neural Networks for Two-Dimensional Curvature
Computation in the Level-Set Method [0.0]
We present an error-neural-modeling-based strategy for approximating two-dimensional curvature in the level-set method.
Our main contribution is a redesigned hybrid solver that relies on numerical schemes to enable machine-learning operations on demand.
arXiv Detail & Related papers (2022-01-22T05:14:40Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - LQF: Linear Quadratic Fine-Tuning [114.3840147070712]
We present the first method for linearizing a pre-trained model that achieves comparable performance to non-linear fine-tuning.
LQF consists of simple modifications to the architecture, loss function and optimization typically used for classification.
arXiv Detail & Related papers (2020-12-21T06:40:20Z) - Understanding Implicit Regularization in Over-Parameterized Single Index
Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model.
We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z) - Provably Efficient Neural Estimation of Structural Equation Model: An
Adversarial Approach [144.21892195917758]
We study estimation in a class of generalized Structural equation models (SEMs)
We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using a gradient descent.
For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting.
arXiv Detail & Related papers (2020-07-02T17:55:47Z) - To Each Optimizer a Norm, To Each Norm its Generalization [31.682969645989512]
We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes.
We argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data results in better generalizations.
arXiv Detail & Related papers (2020-06-11T21:07:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.