Weight Prediction Boosts the Convergence of AdamW
- URL: http://arxiv.org/abs/2302.00195v2
- Date: Tue, 8 Aug 2023 02:06:23 GMT
- Title: Weight Prediction Boosts the Convergence of AdamW
- Authors: Lei Guan
- Abstract summary: We introduce weight prediction into the AdamW to boost its convergence when training the deep neural network (DNN) models.
In particular, ahead of each mini-batch training, we predict the future weights according to the update rule of AdamW and then apply the predicted future weights.
- Score: 3.7485728774744556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce weight prediction into the AdamW optimizer to
boost its convergence when training the deep neural network (DNN) models. In
particular, ahead of each mini-batch training, we predict the future weights
according to the update rule of AdamW and then apply the predicted future
weights to do both forward pass and backward propagation. In this way, the
AdamW optimizer always utilizes the gradients w.r.t. the future weights instead
of current weights to update the DNN parameters, making the AdamW optimizer
achieve better convergence. Our proposal is simple and straightforward to
implement but effective in boosting the convergence of DNN training. We
performed extensive experimental evaluations on image classification and
language modeling tasks to verify the effectiveness of our proposal. The
experimental results validate that our proposal can boost the convergence of
AdamW and achieve better accuracy than AdamW when training the DNN models.
Related papers
- Variational Learning is Effective for Large Deep Networks [76.94351631300788]
We show that an Improved Variational Online Newton consistently matches or outperforms Adam for training large networks.
IVON's computational costs are nearly identical to Adam but its predictive uncertainty is better.
We find overwhelming evidence that variational learning is effective.
arXiv Detail & Related papers (2024-02-27T16:11:05Z) - Switch EMA: A Free Lunch for Better Flatness and Sharpness [58.55452862747021]
This work unveils the full potential of EMA with a single line of modification, i.e., switching parameters to the original model after each epoch, dubbed as Switch (SEMA)
From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness.
arXiv Detail & Related papers (2024-02-14T15:28:42Z) - XGrad: Boosting Gradient-Based Optimizers With Weight Prediction [20.068681423455057]
In this paper, we propose a general deep learning training framework XGrad.
XGrad introduces weight prediction into the popular gradient-based DNNs to boost their convergence and generalization.
The experimental results validate that XGrad can attain higher model accuracy than the baselines when training the models.
arXiv Detail & Related papers (2023-05-26T10:34:00Z) - Boosted Dynamic Neural Networks [53.559833501288146]
A typical EDNN has multiple prediction heads at different layers of the network backbone.
To optimize the model, these prediction heads together with the network backbone are trained on every batch of training data.
Treating training and testing inputs differently at the two phases will cause the mismatch between training and testing data distributions.
We formulate an EDNN as an additive model inspired by gradient boosting, and propose multiple training techniques to optimize the model effectively.
arXiv Detail & Related papers (2022-11-30T04:23:12Z) - Amos: An Adam-style Optimizer with Adaptive Weight Decay towards
Model-Oriented Scale [16.97880876259831]
Amos is a gradient-based system for training deep neural networks.
It can be viewed as an Adam with theoretically supported, adaptive learning-rate decay and weight decay.
arXiv Detail & Related papers (2022-10-21T02:37:58Z) - How Do Adam and Training Strategies Help BNNs Optimization? [50.22482900678071]
We show that Adam is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability.
We derive a simple training scheme, building on existing Adam-based optimization, which achieves 70.5% top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-06-21T17:59:51Z) - Deep Time Delay Neural Network for Speech Enhancement with Full Data
Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z) - Train-by-Reconnect: Decoupling Locations of Weights from their Values [6.09170287691728]
We show that untrained deep neural networks (DNNs) are different from trained ones.
We propose a novel method named Lookahead Permutation (LaPerm) to train DNNs by reconnecting the weights.
When the initial weights share a single value, our method finds weight neural network with far better-than-chance accuracy.
arXiv Detail & Related papers (2020-03-05T12:40:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.