Related papers: On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance

On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance

URL: http://arxiv.org/abs/2302.01029v3
Date: Fri, 12 Jul 2024 09:46:14 GMT
Title: On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance
Authors: Guoqiang Zhang,
Abstract summary: We exploit the layerwise statistics to suppress the range of the adaptive stepsizes of Adam. The resulting algorithm is referred to as SET-Adam, where SET is a brief notation of the three operations. SET-Adam produces higher validation accuracies than Adam and AdaBelief for training ResNet18 over ImageNet.
Score: 2.71467552808655
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A number of recent adaptive optimizers improve the generalisation performance of Adam by essentially reducing the variance of adaptive stepsizes to get closer to SGD with momentum. Following the above motivation, we suppress the range of the adaptive stepsizes of Adam by exploiting the layerwise gradient statistics. In particular, at each iteration, we propose to perform three consecutive operations on the second momentum v_t before using it to update a DNN model: (1): down-scaling, (2): epsilon-embedding, and (3): down-translating. The resulting algorithm is referred to as SET-Adam, where SET is a brief notation of the three operations. The down-scaling operation on v_t is performed layerwise by making use of the angles between the layerwise subvectors of v_t and the corresponding all-one subvectors. Extensive experimental results show that SET-Adam outperforms eight adaptive optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the eight adaptive methods when training WGAN-GP models for image generation tasks. Furthermore, SET-Adam produces higher validation accuracies than Adam and AdaBelief for training ResNet18 over ImageNet.

Related papers

No More Adam: Learning Rate Scaling at Initialization is All You Need [13.892699813809857]
SGD-SaI is a simple yet effective enhancement to gradient descent with momentum (SGDM) By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks.
arXiv Detail & Related papers (2024-12-16T13:41:37Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training. We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
Deconstructing What Makes a Good Optimizer for Language Models [7.9224468703944115]
We compare several optimization algorithms, including SGD, Adafactor, Adam, and Lion, in the context of autoregressive language modeling. Our findings indicate that, except for SGD, these algorithms all perform comparably both in their optimal performance.
arXiv Detail & Related papers (2024-07-10T18:11:40Z)
Variational Stochastic Gradient Descent for Deep Neural Networks [16.96187187108041]
Current state-of-the-arts are adaptive gradient-based optimization methods such as Adam. Here, we propose to combine both approaches, resulting in the Variational Gradient Descent (VSGD) We show how our VSGD method relates to other adaptive gradient-baseds like Adam.
arXiv Detail & Related papers (2024-04-09T18:02:01Z)
MADA: Meta-Adaptive Optimizers through hyper-gradient Descent [73.1383658672682]
We introduce Meta-Adaptives (MADA), a unified framework that can generalize several known convergences and dynamically learn the most suitable one during training. We empirically compare MADA to other populars on vision and language tasks, and find that MADA consistently outperforms Adam and other populars. We also propose AVGrad, a modification of AMS that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization.
arXiv Detail & Related papers (2024-01-17T00:16:46Z)
Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework. We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z)
Prompt Tuning for Parameter-efficient Medical Image Segmentation [79.09285179181225]
We propose and investigate several contributions to achieve a parameter-efficient but effective adaptation for semantic segmentation on two medical imaging datasets. We pre-train this architecture with a dedicated dense self-supervision scheme based on assignments to online generated prototypes. We demonstrate that the resulting neural network model is able to attenuate the gap between fully fine-tuned and parameter-efficiently adapted models.
arXiv Detail & Related papers (2022-11-16T21:55:05Z)
AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNs [23.523389372182613]
gradient descent (SGD)s are generally used to train the convolutional neural networks (CNNs) Existing SGDs do not exploit the gradient norm of past iterations and lead to poor convergence and performance. We propose a novel AdaNorm based SGDs by correcting the norm of gradient in each iteration based on the adaptive training history of gradient norm.
arXiv Detail & Related papers (2022-10-12T16:17:25Z)
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models [134.83964935755964]
In deep learning, different kinds of deep networks typically need different extrapolations, which have to be chosen after multiple trials. To relieve this issue and consistently improve the model training speed deep networks, we propose the ADAtive Nesterov momentum Transformer.
arXiv Detail & Related papers (2022-08-13T16:04:39Z)
A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning [0.6526824510982802]
Adaptive gradient methods have become popular in optimizing deep neural networks. Recent examples include AdaGrad and Adam. We develop a generic framework for adaptive gradient methods.
arXiv Detail & Related papers (2022-06-04T17:55:33Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning [91.13797346047984]
We introduce ADAHESSIAN, a second order optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods.
arXiv Detail & Related papers (2020-06-01T05:00:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.