Weight Prediction Boosts the Convergence of AdamW
- URL: http://arxiv.org/abs/2302.00195v2
- Date: Tue, 8 Aug 2023 02:06:23 GMT
- Title: Weight Prediction Boosts the Convergence of AdamW
- Authors: Lei Guan
- Abstract summary: We introduce weight prediction into the AdamW to boost its convergence when training the deep neural network (DNN) models.
In particular, ahead of each mini-batch training, we predict the future weights according to the update rule of AdamW and then apply the predicted future weights.
- Score: 3.7485728774744556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce weight prediction into the AdamW optimizer to
boost its convergence when training the deep neural network (DNN) models. In
particular, ahead of each mini-batch training, we predict the future weights
according to the update rule of AdamW and then apply the predicted future
weights to do both forward pass and backward propagation. In this way, the
AdamW optimizer always utilizes the gradients w.r.t. the future weights instead
of current weights to update the DNN parameters, making the AdamW optimizer
achieve better convergence. Our proposal is simple and straightforward to
implement but effective in boosting the convergence of DNN training. We
performed extensive experimental evaluations on image classification and
language modeling tasks to verify the effectiveness of our proposal. The
experimental results validate that our proposal can boost the convergence of
AdamW and achieve better accuracy than AdamW when training the DNN models.
Related papers
- Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps [65.64965527170156]
We adapt the widely used Adam optimiser for use in reinforcement learning.
We show that Adam-Rel uses the local timestep within an epoch, essentially resetting Adam's timestep to 0 after target changes.
We then show that increases in gradient norm occur in RL in practice, and examine the differences between our theoretical model and the observed data.
arXiv Detail & Related papers (2024-12-22T18:01:08Z) - MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
We propose a unified training framework for deep neural networks.
We introduce three instances of MARS that leverage preconditioned gradient optimization.
Results indicate that the implementation of MARS consistently outperforms Adam.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Variational Learning is Effective for Large Deep Networks [76.94351631300788]
We show that an Improved Variational Online Newton consistently matches or outperforms Adam for training large networks.
IVON's computational costs are nearly identical to Adam but its predictive uncertainty is better.
We find overwhelming evidence that variational learning is effective.
arXiv Detail & Related papers (2024-02-27T16:11:05Z) - Switch EMA: A Free Lunch for Better Flatness and Sharpness [58.55452862747021]
This work unveils the full potential of EMA with a single line of modification, i.e., switching parameters to the original model after each epoch, dubbed as Switch (SEMA)
From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness.
arXiv Detail & Related papers (2024-02-14T15:28:42Z) - XGrad: Boosting Gradient-Based Optimizers With Weight Prediction [20.068681423455057]
In this paper, we propose a general deep learning training framework XGrad.
XGrad introduces weight prediction into the popular gradient-based DNNs to boost their convergence and generalization.
The experimental results validate that XGrad can attain higher model accuracy than the baselines when training the models.
arXiv Detail & Related papers (2023-05-26T10:34:00Z) - Boosted Dynamic Neural Networks [53.559833501288146]
A typical EDNN has multiple prediction heads at different layers of the network backbone.
To optimize the model, these prediction heads together with the network backbone are trained on every batch of training data.
Treating training and testing inputs differently at the two phases will cause the mismatch between training and testing data distributions.
We formulate an EDNN as an additive model inspired by gradient boosting, and propose multiple training techniques to optimize the model effectively.
arXiv Detail & Related papers (2022-11-30T04:23:12Z) - Amos: An Adam-style Optimizer with Adaptive Weight Decay towards
Model-Oriented Scale [16.97880876259831]
Amos is a gradient-based system for training deep neural networks.
It can be viewed as an Adam with theoretically supported, adaptive learning-rate decay and weight decay.
arXiv Detail & Related papers (2022-10-21T02:37:58Z) - How Do Adam and Training Strategies Help BNNs Optimization? [50.22482900678071]
We show that Adam is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability.
We derive a simple training scheme, building on existing Adam-based optimization, which achieves 70.5% top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-06-21T17:59:51Z) - Train-by-Reconnect: Decoupling Locations of Weights from their Values [6.09170287691728]
We show that untrained deep neural networks (DNNs) are different from trained ones.
We propose a novel method named Lookahead Permutation (LaPerm) to train DNNs by reconnecting the weights.
When the initial weights share a single value, our method finds weight neural network with far better-than-chance accuracy.
arXiv Detail & Related papers (2020-03-05T12:40:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.