Training Deep Neural Networks with Adaptive Momentum Inspired by the
Quadratic Optimization
- URL: http://arxiv.org/abs/2110.09057v1
- Date: Mon, 18 Oct 2021 07:03:48 GMT
- Title: Training Deep Neural Networks with Adaptive Momentum Inspired by the
Quadratic Optimization
- Authors: Tao Sun, Huaming Ling, Zuoqiang Shi, Dongsheng Li, Bao Wang
- Abstract summary: We propose a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for optimization.
Our proposed adaptive heavy ball momentum can improve gradient descent (SGD) and Adam.
We verify the efficiency of SGD and Adam with the new adaptive momentum on extensive machine learning benchmarks, including image classification, language modeling, and machine translation.
- Score: 20.782428252187024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Heavy ball momentum is crucial in accelerating (stochastic) gradient-based
optimization algorithms for machine learning. Existing heavy ball momentum is
usually weighted by a uniform hyperparameter, which relies on excessive tuning.
Moreover, the calibrated fixed hyperparameter may not lead to optimal
performance. In this paper, to eliminate the effort for tuning the
momentum-related hyperparameter, we propose a new adaptive momentum inspired by
the optimal choice of the heavy ball momentum for quadratic optimization. Our
proposed adaptive heavy ball momentum can improve stochastic gradient descent
(SGD) and Adam. SGD and Adam with the newly designed adaptive momentum are more
robust to large learning rates, converge faster, and generalize better than the
baselines. We verify the efficiency of SGD and Adam with the new adaptive
momentum on extensive machine learning benchmarks, including image
classification, language modeling, and machine translation. Finally, we provide
convergence guarantees for SGD and Adam with the proposed adaptive momentum.
Related papers
- MADA: Meta-Adaptive Optimizers through hyper-gradient Descent [73.1383658672682]
We introduce Meta-Adaptives (MADA), a unified framework that can generalize several known convergences and dynamically learn the most suitable one during training.
We empirically compare MADA to other populars on vision and language tasks, and find that MADA consistently outperforms Adam and other populars.
We also propose AVGrad, a modification of AMS that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization.
arXiv Detail & Related papers (2024-01-17T00:16:46Z) - Flatter, faster: scaling momentum for optimal speedup of SGD [0.0]
We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks.
We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
arXiv Detail & Related papers (2022-10-28T20:41:48Z) - Improving Multi-fidelity Optimization with a Recurring Learning Rate for
Hyperparameter Tuning [7.591442522626255]
We propose Multi-fidelity Optimization with a Recurring Learning rate (MORL)
MORL incorporates CNNs' optimization process into multi-fidelity optimization.
It alleviates the problem of slow-starter and achieves a more precise low-fidelity approximation.
arXiv Detail & Related papers (2022-09-26T08:16:31Z) - Bayesian Optimization and Deep Learning forsteering wheel angle
prediction [58.720142291102135]
This work aims to obtain an accurate model for the prediction of the steering angle in an automated driving system.
BO was able to identify, within a limited number of trials, a model -- namely BOST-LSTM -- which resulted, the most accurate when compared to classical end-to-end driving models.
arXiv Detail & Related papers (2021-10-22T15:25:14Z) - On the Importance of Hyperparameter Optimization for Model-based
Reinforcement Learning [27.36718899899319]
Model-based Reinforcement Learning (MBRL) is a promising framework for learning control in a data-efficient manner.
MBRL typically requires significant human expertise before it can be applied to new problems and domains.
arXiv Detail & Related papers (2021-02-26T18:57:47Z) - Online hyperparameter optimization by real-time recurrent learning [57.01871583756586]
Our framework takes advantage of the analogy between hyperparameter optimization and parameter learning in neural networks (RNNs)
It adapts a well-studied family of online learning algorithms for RNNs to tune hyperparameters and network parameters simultaneously.
This procedure yields systematically better generalization performance compared to standard methods, at a fraction of wallclock time.
arXiv Detail & Related papers (2021-02-15T19:36:18Z) - Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style
Adaptive Momentum [9.843647947055745]
In deep learning practice, momentum is usually weighted by a well-calibrated constant.
We propose a novel emphadaptive momentum for improving DNNs training.
arXiv Detail & Related papers (2020-12-03T18:59:43Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Multi-level Training and Bayesian Optimization for Economical
Hyperparameter Optimization [12.92634461859467]
In this paper, we develop an effective approach to reducing the total amount of required training time for Hyperparameter Optimization.
We propose a truncated additive Gaussian process model to calibrate approximate performance measurements generated by light training.
Based on the model, a sequential model-based algorithm is developed to generate the performance profile of the configuration space as well as find optimal ones.
arXiv Detail & Related papers (2020-07-20T09:03:02Z) - Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate
and Momentum [97.84312669132716]
We disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and flat minima selection.
Our experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods.
arXiv Detail & Related papers (2020-06-29T05:21:02Z) - Improved Adversarial Training via Learned Optimizer [101.38877975769198]
We propose a framework to improve the robustness of adversarial training models.
By co-training's parameters model's weights, the proposed framework consistently improves robustness and steps adaptively for update directions.
arXiv Detail & Related papers (2020-04-25T20:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.