Related papers: Weight Prediction Boosts the Convergence of AdamW

Weight Prediction Boosts the Convergence of AdamW

URL: http://arxiv.org/abs/2302.00195v2
Date: Tue, 8 Aug 2023 02:06:23 GMT
Title: Weight Prediction Boosts the Convergence of AdamW
Authors: Lei Guan
Abstract summary: We introduce weight prediction into the AdamW to boost its convergence when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, we predict the future weights according to the update rule of AdamW and then apply the predicted future weights.
Score: 3.7485728774744556
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce weight prediction into the AdamW optimizer to boost its convergence when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, we predict the future weights according to the update rule of AdamW and then apply the predicted future weights to do both forward pass and backward propagation. In this way, the AdamW optimizer always utilizes the gradients w.r.t. the future weights instead of current weights to update the DNN parameters, making the AdamW optimizer achieve better convergence. Our proposal is simple and straightforward to implement but effective in boosting the convergence of DNN training. We performed extensive experimental evaluations on image classification and language modeling tasks to verify the effectiveness of our proposal. The experimental results validate that our proposal can boost the convergence of AdamW and achieve better accuracy than AdamW when training the DNN models.

Related papers

Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps [65.64965527170156]
We adapt the widely used Adam optimiser for use in reinforcement learning. We show that Adam-Rel uses the local timestep within an epoch, essentially resetting Adam's timestep to 0 after target changes. We then show that increases in gradient norm occur in RL in practice, and examine the differences between our theoretical model and the observed data.
arXiv Detail & Related papers (2024-12-22T18:01:08Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training. We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
Variational Learning is Effective for Large Deep Networks [76.94351631300788]
We show that an Improved Variational Online Newton consistently matches or outperforms Adam for training large networks. IVON's computational costs are nearly identical to Adam but its predictive uncertainty is better. We find overwhelming evidence that variational learning is effective.
arXiv Detail & Related papers (2024-02-27T16:11:05Z)
Switch EMA: A Free Lunch for Better Flatness and Sharpness [58.55452862747021]
This work unveils the full potential of EMA with a single line of modification, i.e., switching parameters to the original model after each epoch, dubbed as Switch (SEMA) From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness.
arXiv Detail & Related papers (2024-02-14T15:28:42Z)
XGrad: Boosting Gradient-Based Optimizers With Weight Prediction [20.068681423455057]
In this paper, we propose a general deep learning training framework XGrad. XGrad introduces weight prediction into the popular gradient-based DNNs to boost their convergence and generalization. The experimental results validate that XGrad can attain higher model accuracy than the baselines when training the models.
arXiv Detail & Related papers (2023-05-26T10:34:00Z)
Boosted Dynamic Neural Networks [53.559833501288146]
A typical EDNN has multiple prediction heads at different layers of the network backbone. To optimize the model, these prediction heads together with the network backbone are trained on every batch of training data. Treating training and testing inputs differently at the two phases will cause the mismatch between training and testing data distributions. We formulate an EDNN as an additive model inspired by gradient boosting, and propose multiple training techniques to optimize the model effectively.
arXiv Detail & Related papers (2022-11-30T04:23:12Z)
Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale [16.97880876259831]
Amos is a gradient-based system for training deep neural networks. It can be viewed as an Adam with theoretically supported, adaptive learning-rate decay and weight decay.
arXiv Detail & Related papers (2022-10-21T02:37:58Z)
How Do Adam and Training Strategies Help BNNs Optimization? [50.22482900678071]
We show that Adam is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability. We derive a simple training scheme, building on existing Adam-based optimization, which achieves 70.5% top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-06-21T17:59:51Z)
Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning. To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z)
Train-by-Reconnect: Decoupling Locations of Weights from their Values [6.09170287691728]
We show that untrained deep neural networks (DNNs) are different from trained ones. We propose a novel method named Lookahead Permutation (LaPerm) to train DNNs by reconnecting the weights. When the initial weights share a single value, our method finds weight neural network with far better-than-chance accuracy.
arXiv Detail & Related papers (2020-03-05T12:40:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.