Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization
- URL: http://arxiv.org/abs/2108.11371v1
- Date: Wed, 25 Aug 2021 17:58:21 GMT
- Title: Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization
- Authors: Difan Zou and Yuan Cao and Yuanzhi Li and Quanquan Gu
- Abstract summary: We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
- Score: 118.50301177912381
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive gradient methods such as Adam have gained increasing popularity in
deep learning optimization. However, it has been observed that compared with
(stochastic) gradient descent, Adam can converge to a different solution with a
significantly worse test error in many deep learning applications such as image
classification, even with a fine-tuned regularization. In this paper, we
provide a theoretical explanation for this phenomenon: we show that in the
nonconvex setting of learning over-parameterized two-layer convolutional neural
networks starting from the same random initialization, for a class of data
distributions (inspired from image data), Adam and gradient descent (GD) can
converge to different global solutions of the training objective with provably
different generalization errors, even with weight decay regularization. In
contrast, we show that if the training objective is convex, and the weight
decay regularization is employed, any optimization algorithms including Adam
and GD will converge to the same solution if the training is successful. This
suggests that the inferior generalization performance of Adam is fundamentally
tied to the nonconvex landscape of deep learning optimization.
Related papers
- Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning [2.695991050833627]
We propose a new optimization algorithm named CG-like-Adam for deep learning.
Specifically, both the first-order and the second-order moment estimation of generic Adam are replaced by the conjugate-gradient-like.
Numerical experiments show the superiority of the proposed algorithm based on the CIFAR10/100 dataset.
arXiv Detail & Related papers (2024-04-02T07:57:17Z) - AdamL: A fast adaptive gradient method incorporating loss function [1.6025685183216696]
We propose AdamL, a novel variant of the Adam, that takes into account the loss function information to attain better results.
We show that AdamL achieves either the fastest convergence or the lowest objective function values when compared to Adam, EAdam, and AdaBelief.
In the case of vanilla convolutional neural networks, AdamL stands out from the other Adam's variants and does not require the manual adjustment of the learning rate during the later stage of the training.
arXiv Detail & Related papers (2023-12-23T16:32:29Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Convergence rates for gradient descent in the training of
overparameterized artificial neural networks with biases [3.198144010381572]
In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches.
It is still unclear why randomly gradient descent algorithms reach their limits.
arXiv Detail & Related papers (2021-02-23T18:17:47Z) - Linear Regression with Distributed Learning: A Generalization Error
Perspective [0.0]
We investigate the performance of distributed learning for large-scale linear regression.
We focus on the generalization error, i.e., the performance on unseen data.
Our results show that the generalization error of the distributed solution can be substantially higher than that of the centralized solution.
arXiv Detail & Related papers (2021-01-22T08:43:28Z) - Towards Practical Adam: Non-Convexity, Convergence Theory, and
Mini-Batch Acceleration [12.744658958445024]
Adam is one of the most influential adaptive algorithms for training deep neural networks.
Existing approaches, such as decreasing an adaptive learning rate, adopting a big batch size, have tried to promote Adam-type algorithms to converge.
We introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of historical base learning rate.
arXiv Detail & Related papers (2021-01-14T06:42:29Z) - Fast Learning of Graph Neural Networks with Guaranteed Generalizability:
One-hidden-layer Case [93.37576644429578]
Graph neural networks (GNNs) have made great progress recently on learning from graph-structured data in practice.
We provide a theoretically-grounded generalizability analysis of GNNs with one hidden layer for both regression and binary classification problems.
arXiv Detail & Related papers (2020-06-25T00:45:52Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.