Careful with that Scalpel: Improving Gradient Surgery with an EMA
- URL: http://arxiv.org/abs/2402.02998v1
- Date: Mon, 5 Feb 2024 13:37:00 GMT
- Title: Careful with that Scalpel: Improving Gradient Surgery with an EMA
- Authors: Yu-Guan Hsieh, James Thornton, Eugene Ndiaye, Michal Klein, Marco
Cuturi, Pierre Ablin
- Abstract summary: We show how one can improve performance by blending the gradients beyond a simple sum.
We demonstrate that our method, Bloop, can lead to much better performances on NLP and vision experiments.
- Score: 32.73961859864032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Beyond minimizing a single training loss, many deep learning estimation
pipelines rely on an auxiliary objective to quantify and encourage desirable
properties of the model (e.g. performance on another dataset, robustness,
agreement with a prior). Although the simplest approach to incorporating an
auxiliary loss is to sum it with the training loss as a regularizer, recent
works have shown that one can improve performance by blending the gradients
beyond a simple sum; this is known as gradient surgery. We cast the problem as
a constrained minimization problem where the auxiliary objective is minimized
among the set of minimizers of the training loss. To solve this bilevel
problem, we follow a parameter update direction that combines the training loss
gradient and the orthogonal projection of the auxiliary gradient to the
training gradient. In a setting where gradients come from mini-batches, we
explain how, using a moving average of the training loss gradients, we can
carefully maintain this critical orthogonality property. We demonstrate that
our method, Bloop, can lead to much better performances on NLP and vision
experiments than other gradient surgery methods without EMA.
Related papers
- An Effective Dynamic Gradient Calibration Method for Continual Learning [11.555822066922508]
Continual learning (CL) is a fundamental topic in machine learning, where the goal is to train a model with continuously incoming data and tasks.
Due to the memory limit, we cannot store all the historical data, and therefore confront the catastrophic forgetting'' problem.
We develop an effective algorithm to calibrate the gradient in each updating step of the model.
arXiv Detail & Related papers (2024-07-30T16:30:09Z) - Continual Learning with Scaled Gradient Projection [8.847574864259391]
In neural networks, continual learning results in gradient interference among sequential tasks, leading to forgetting of old tasks while learning new ones.
We propose a Scaled Gradient Projection (SGP) method to improve new learning while minimizing forgetting.
We conduct experiments ranging from continual image classification to reinforcement learning tasks and report better performance with less training overhead than the state-of-the-art approaches.
arXiv Detail & Related papers (2023-02-02T19:46:39Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Gradient Correction beyond Gradient Descent [63.33439072360198]
gradient correction is apparently the most crucial aspect for the training of a neural network.
We introduce a framework (textbfGCGD) to perform gradient correction.
Experiment results show that our gradient correction framework can effectively improve the gradient quality to reduce training epochs by $sim$ 20% and also improve the network performance.
arXiv Detail & Related papers (2022-03-16T01:42:25Z) - Penalizing Gradient Norm for Efficiently Improving Generalization in
Deep Learning [13.937644559223548]
How to train deep neural networks (DNNs) to generalize well is a central concern in deep learning.
We propose an effective method to improve the model generalization by penalizing the gradient norm of loss function during optimization.
arXiv Detail & Related papers (2022-02-08T02:03:45Z) - Bilevel learning of l1-regularizers with closed-form gradients(BLORC) [8.138650738423722]
We present a method for supervised learning of sparsity-promoting regularizers.
The parameters are learned to minimize the mean squared error of reconstruction on a training set of ground truth signal and measurement pairs.
arXiv Detail & Related papers (2021-11-21T17:01:29Z) - Unbiased Risk Estimators Can Mislead: A Case Study of Learning with
Complementary Labels [92.98756432746482]
We study a weakly supervised problem called learning with complementary labels.
We show that the quality of gradient estimation matters more in risk minimization.
We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
arXiv Detail & Related papers (2020-07-05T04:19:37Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Regularizing Meta-Learning via Gradient Dropout [102.29924160341572]
meta-learning models are prone to overfitting when there are no sufficient training tasks for the meta-learners to generalize.
We introduce a simple yet effective method to alleviate the risk of overfitting for gradient-based meta-learning.
arXiv Detail & Related papers (2020-04-13T10:47:02Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.