Related papers: XGrad: Boosting Gradient-Based Optimizers With Weight Prediction

XGrad: Boosting Gradient-Based Optimizers With Weight Prediction

URL: http://arxiv.org/abs/2305.18240v2
Date: Sun, 7 Apr 2024 16:07:53 GMT
Title: XGrad: Boosting Gradient-Based Optimizers With Weight Prediction
Authors: Lei Guan, Dongsheng Li, Yanqi Shi, Jian Meng,
Abstract summary: In this paper, we propose a general deep learning training framework XGrad. XGrad introduces weight prediction into the popular gradient-based DNNs to boost their convergence and generalization. The experimental results validate that XGrad can attain higher model accuracy than the baselines when training the models.
Score: 20.068681423455057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we propose a general deep learning training framework XGrad which introduces weight prediction into the popular gradient-based optimizers to boost their convergence and generalization when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, the future weights are predicted according to the update rule of the used optimizer and are then applied to both the forward pass and backward propagation. In this way, during the whole training period, the optimizer always utilizes the gradients w.r.t. the future weights to update the DNN parameters, making the gradient-based optimizer achieve better convergence and generalization compared to the original optimizer without weight prediction. XGrad is rather straightforward to implement yet pretty effective in boosting the convergence of gradient-based optimizers and the accuracy of DNN models. Empirical results concerning five popular optimizers including SGD with momentum, Adam, AdamW, AdaBelief, and AdaM3 demonstrate the effectiveness of our proposal. The experimental results validate that XGrad can attain higher model accuracy than the baseline optimizers when training the DNN models. The code of XGrad will be available at: https://github.com/guanleics/XGrad.

Related papers

Grams: Gradient Descent with Adaptive Momentum Scaling [19.966519464887575]
We introduce $mathbfG$radient Descent with $mathbfA$daptive $mathbfM$omentum $mathbfS$caling ($mathbfGrams) Grams is an optimization algorithm that decouples the direction and magnitude of updates in deep learning.
arXiv Detail & Related papers (2024-12-22T17:39:32Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training. We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
AdaFisher: Adaptive Second Order Optimization via Fisher Information [22.851200800265914]
We present AdaFisher, an adaptive second-order that leverages a block-diagonal approximation to the Fisher information matrix for adaptive preconditioning gradient. We demonstrate that AdaFisher outperforms the SOTAs in terms of both accuracy and convergence speed.
arXiv Detail & Related papers (2024-05-26T01:25:02Z)
Weight Prediction Boosts the Convergence of AdamW [3.7485728774744556]
We introduce weight prediction into the AdamW to boost its convergence when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, we predict the future weights according to the update rule of AdamW and then apply the predicted future weights.
arXiv Detail & Related papers (2023-02-01T02:58:29Z)
Boosted Dynamic Neural Networks [53.559833501288146]
A typical EDNN has multiple prediction heads at different layers of the network backbone. To optimize the model, these prediction heads together with the network backbone are trained on every batch of training data. Treating training and testing inputs differently at the two phases will cause the mismatch between training and testing data distributions. We formulate an EDNN as an additive model inspired by gradient boosting, and propose multiple training techniques to optimize the model effectively.
arXiv Detail & Related papers (2022-11-30T04:23:12Z)
VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles. We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates. We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z)
AdaNorm: Adaptive Gradient Norm Correction based Optimizer for CNNs [23.523389372182613]
gradient descent (SGD)s are generally used to train the convolutional neural networks (CNNs) Existing SGDs do not exploit the gradient norm of past iterations and lead to poor convergence and performance. We propose a novel AdaNorm based SGDs by correcting the norm of gradient in each iteration based on the adaptive training history of gradient norm.
arXiv Detail & Related papers (2022-10-12T16:17:25Z)
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models [134.83964935755964]
In deep learning, different kinds of deep networks typically need different extrapolations, which have to be chosen after multiple trials. To relieve this issue and consistently improve the model training speed deep networks, we propose the ADAtive Nesterov momentum Transformer.
arXiv Detail & Related papers (2022-08-13T16:04:39Z)
DEBOSH: Deep Bayesian Shape Optimization [48.80431740983095]
We propose a novel uncertainty-based method tailored to shape optimization. It enables effective BO and increases the quality of the resulting shapes beyond that of state-of-the-art approaches.
arXiv Detail & Related papers (2021-09-28T11:01:42Z)
Tom: Leveraging trend of the observed gradients for faster convergence [0.0]
Tom is a novel variant of Adam that takes into account the trend observed for the gradients in the landscape in the loss traversed by the neural network. Tom outperforms Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and has a faster convergence.
arXiv Detail & Related papers (2021-09-07T20:19:40Z)
How Do Adam and Training Strategies Help BNNs Optimization? [50.22482900678071]
We show that Adam is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability. We derive a simple training scheme, building on existing Adam-based optimization, which achieves 70.5% top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-06-21T17:59:51Z)
A Bop and Beyond: A Second Order Optimizer for Binarized Neural Networks [0.0]
optimization of Binary Neural Networks (BNNs) relies on approximating the real-valued weights with their binarized representations. In this paper, we take an approach parallel to Adam which also uses the second raw moment estimate to normalize the first raw moment before doing the comparison with the threshold. We present two versions of the proposed: a biased one and a bias-corrected one, each with its own applications.
arXiv Detail & Related papers (2021-04-11T22:20:09Z)
Self-Directed Online Machine Learning for Topology Optimization [58.920693413667216]
Self-directed Online Learning Optimization integrates Deep Neural Network (DNN) with Finite Element Method (FEM) calculations. Our algorithm was tested by four types of problems including compliance minimization, fluid-structure optimization, heat transfer enhancement and truss optimization. It reduced the computational time by 2 5 orders of magnitude compared with directly using methods, and outperformed all state-of-the-art algorithms tested in our experiments.
arXiv Detail & Related papers (2020-02-04T20:00:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.