Adam-family Methods with Decoupled Weight Decay in Deep Learning
- URL: http://arxiv.org/abs/2310.08858v1
- Date: Fri, 13 Oct 2023 04:59:44 GMT
- Title: Adam-family Methods with Decoupled Weight Decay in Deep Learning
- Authors: Kuangyu Ding, Nachuan Xiao, Kim-Chuan Toh
- Abstract summary: We investigate the convergence properties of a wide of Adam-family methods for nonsmooth nonsmooth networks.
We propose a novel Adam-family method named Adam with Decoupled Weight Decay (AdamD) in our proposed framework.
- Score: 3.4376560669160394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate the convergence properties of a wide class of
Adam-family methods for minimizing quadratically regularized nonsmooth
nonconvex optimization problems, especially in the context of training
nonsmooth neural networks with weight decay. Motivated by the AdamW method, we
propose a novel framework for Adam-family methods with decoupled weight decay.
Within our framework, the estimators for the first-order and second-order
moments of stochastic subgradients are updated independently of the weight
decay term. Under mild assumptions and with non-diminishing stepsizes for
updating the primary optimization variables, we establish the convergence
properties of our proposed framework. In addition, we show that our proposed
framework encompasses a wide variety of well-known Adam-family methods, hence
offering convergence guarantees for these methods in the training of nonsmooth
neural networks. More importantly, we show that our proposed framework
asymptotically approximates the SGD method, thereby providing an explanation
for the empirical observation that decoupled weight decay enhances
generalization performance for Adam-family methods. As a practical application
of our proposed framework, we propose a novel Adam-family method named Adam
with Decoupled Weight Decay (AdamD), and establish its convergence properties
under mild conditions. Numerical experiments demonstrate that AdamD outperforms
Adam and is comparable to AdamW, in the aspects of both generalization
performance and efficiency.
Related papers
- A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD [28.905886549938305]
We introduce a novel and comprehensive framework for analyzing the convergence properties of Adam.
We show that Adam attains non-asymptotic complexity sample bounds similar to those of gradient descent.
arXiv Detail & Related papers (2024-10-06T12:15:00Z) - An Optimization-based Deep Equilibrium Model for Hyperspectral Image
Deconvolution with Convergence Guarantees [71.57324258813675]
We propose a novel methodology for addressing the hyperspectral image deconvolution problem.
A new optimization problem is formulated, leveraging a learnable regularizer in the form of a neural network.
The derived iterative solver is then expressed as a fixed-point calculation problem within the Deep Equilibrium framework.
arXiv Detail & Related papers (2023-06-10T08:25:16Z) - Adam-family Methods for Nonsmooth Optimization with Convergence
Guarantees [5.69991777684143]
We introduce a novel two-timescale framework that adopts a two-timescale updating scheme, and prove its convergence properties under mild assumptions.
Our proposed framework encompasses various popular Adam-family methods, providing convergence guarantees for these methods in training nonsmooth neural networks.
We develop subgradient methods that incorporate clipping techniques for training nonsmooth neural networks with heavy-tailed noise.
arXiv Detail & Related papers (2023-05-06T05:35:56Z) - A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc.
Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Sampling-free Variational Inference for Neural Networks with
Multiplicative Activation Noise [51.080620762639434]
We propose a more efficient parameterization of the posterior approximation for sampling-free variational inference.
Our approach yields competitive results for standard regression problems and scales well to large-scale image classification tasks.
arXiv Detail & Related papers (2021-03-15T16:16:18Z) - Towards Practical Adam: Non-Convexity, Convergence Theory, and
Mini-Batch Acceleration [12.744658958445024]
Adam is one of the most influential adaptive algorithms for training deep neural networks.
Existing approaches, such as decreasing an adaptive learning rate, adopting a big batch size, have tried to promote Adam-type algorithms to converge.
We introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of historical base learning rate.
arXiv Detail & Related papers (2021-01-14T06:42:29Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - On the Trend-corrected Variant of Adaptive Stochastic Optimization
Methods [30.084554989542475]
We present a new framework for Adam-type methods with the trend information when updating the parameters with the adaptive step size and gradients.
We show empirically the importance of adding the trend component, where our framework outperforms the conventional Adam and AMSGrad methods constantly on the classical models with several real-world datasets.
arXiv Detail & Related papers (2020-01-17T01:23:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.