Training Deep Neural Networks with Adaptive Momentum Inspired by the
Quadratic Optimization
- URL: http://arxiv.org/abs/2110.09057v1
- Date: Mon, 18 Oct 2021 07:03:48 GMT
- Title: Training Deep Neural Networks with Adaptive Momentum Inspired by the
Quadratic Optimization
- Authors: Tao Sun, Huaming Ling, Zuoqiang Shi, Dongsheng Li, Bao Wang
- Abstract summary: We propose a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for optimization.
Our proposed adaptive heavy ball momentum can improve gradient descent (SGD) and Adam.
We verify the efficiency of SGD and Adam with the new adaptive momentum on extensive machine learning benchmarks, including image classification, language modeling, and machine translation.
- Score: 20.782428252187024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Heavy ball momentum is crucial in accelerating (stochastic) gradient-based
optimization algorithms for machine learning. Existing heavy ball momentum is
usually weighted by a uniform hyperparameter, which relies on excessive tuning.
Moreover, the calibrated fixed hyperparameter may not lead to optimal
performance. In this paper, to eliminate the effort for tuning the
momentum-related hyperparameter, we propose a new adaptive momentum inspired by
the optimal choice of the heavy ball momentum for quadratic optimization. Our
proposed adaptive heavy ball momentum can improve stochastic gradient descent
(SGD) and Adam. SGD and Adam with the newly designed adaptive momentum are more
robust to large learning rates, converge faster, and generalize better than the
baselines. We verify the efficiency of SGD and Adam with the new adaptive
momentum on extensive machine learning benchmarks, including image
classification, language modeling, and machine translation. Finally, we provide
convergence guarantees for SGD and Adam with the proposed adaptive momentum.
Related papers
- MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Optimization Hyper-parameter Laws for Large Language Models [56.322914260197734]
We present Opt-Laws, a framework that captures the relationship between hyper- parameters and training outcomes.
Our validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss.
This approach significantly reduces computational costs while enhancing overall model performance.
arXiv Detail & Related papers (2024-09-07T09:37:19Z) - Adaptive Friction in Deep Learning: Enhancing Optimizers with Sigmoid and Tanh Function [0.0]
We introduce sigSignGrad and tanhSignGrad, two novel gradients that integrate adaptive friction coefficients.
Our theoretical analysis demonstrates the wide-ranging adjustment capability of the friction coefficient S.
Experiments on CIFAR-10, Mini-Image-Net using ResNet50 and ViT architectures confirm the superior performance our proposeds.
arXiv Detail & Related papers (2024-08-07T03:20:46Z) - Memory-Efficient Optimization with Factorized Hamiltonian Descent [11.01832755213396]
We introduce a novel adaptive, H-Fac, which incorporates a memory-efficient factorization approach to address this challenge.
By employing a rank-1 parameterization for both momentum and scaling parameter estimators, H-Fac reduces memory costs to a sublinear level.
We develop our algorithms based on principles derived from Hamiltonian dynamics, providing robust theoretical underpinnings in optimization dynamics and convergence guarantees.
arXiv Detail & Related papers (2024-06-14T12:05:17Z) - MADA: Meta-Adaptive Optimizers through hyper-gradient Descent [73.1383658672682]
We introduce Meta-Adaptives (MADA), a unified framework that can generalize several known convergences and dynamically learn the most suitable one during training.
We empirically compare MADA to other populars on vision and language tasks, and find that MADA consistently outperforms Adam and other populars.
We also propose AVGrad, a modification of AMS that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization.
arXiv Detail & Related papers (2024-01-17T00:16:46Z) - Flatter, faster: scaling momentum for optimal speedup of SGD [0.0]
We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks.
We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
arXiv Detail & Related papers (2022-10-28T20:41:48Z) - Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style
Adaptive Momentum [9.843647947055745]
In deep learning practice, momentum is usually weighted by a well-calibrated constant.
We propose a novel emphadaptive momentum for improving DNNs training.
arXiv Detail & Related papers (2020-12-03T18:59:43Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Multi-level Training and Bayesian Optimization for Economical
Hyperparameter Optimization [12.92634461859467]
In this paper, we develop an effective approach to reducing the total amount of required training time for Hyperparameter Optimization.
We propose a truncated additive Gaussian process model to calibrate approximate performance measurements generated by light training.
Based on the model, a sequential model-based algorithm is developed to generate the performance profile of the configuration space as well as find optimal ones.
arXiv Detail & Related papers (2020-07-20T09:03:02Z) - Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate
and Momentum [97.84312669132716]
We disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and flat minima selection.
Our experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods.
arXiv Detail & Related papers (2020-06-29T05:21:02Z) - Improved Adversarial Training via Learned Optimizer [101.38877975769198]
We propose a framework to improve the robustness of adversarial training models.
By co-training's parameters model's weights, the proposed framework consistently improves robustness and steps adaptively for update directions.
arXiv Detail & Related papers (2020-04-25T20:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.