Related papers: Careful with that Scalpel: Improving Gradient Surgery with an EMA

Careful with that Scalpel: Improving Gradient Surgery with an EMA

URL: http://arxiv.org/abs/2402.02998v1
Date: Mon, 5 Feb 2024 13:37:00 GMT
Title: Careful with that Scalpel: Improving Gradient Surgery with an EMA
Authors: Yu-Guan Hsieh, James Thornton, Eugene Ndiaye, Michal Klein, Marco Cuturi, Pierre Ablin
Abstract summary: We show how one can improve performance by blending the gradients beyond a simple sum. We demonstrate that our method, Bloop, can lead to much better performances on NLP and vision experiments.
Score: 32.73961859864032
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Beyond minimizing a single training loss, many deep learning estimation pipelines rely on an auxiliary objective to quantify and encourage desirable properties of the model (e.g. performance on another dataset, robustness, agreement with a prior). Although the simplest approach to incorporating an auxiliary loss is to sum it with the training loss as a regularizer, recent works have shown that one can improve performance by blending the gradients beyond a simple sum; this is known as gradient surgery. We cast the problem as a constrained minimization problem where the auxiliary objective is minimized among the set of minimizers of the training loss. To solve this bilevel problem, we follow a parameter update direction that combines the training loss gradient and the orthogonal projection of the auxiliary gradient to the training gradient. In a setting where gradients come from mini-batches, we explain how, using a moving average of the training loss gradients, we can carefully maintain this critical orthogonality property. We demonstrate that our method, Bloop, can lead to much better performances on NLP and vision experiments than other gradient surgery methods without EMA.

Related papers

An Adaptively Inexact Method for Bilevel Learning Using Primal-Dual Style Differentiation [8.107030227028815]
We consider a bilevel learning framework for learning linear operators. In this framework, the learnable parameters are optimized via a loss function that also depends on ther of a convex optimization problem.
arXiv Detail & Related papers (2024-12-09T12:26:26Z)
An Effective Dynamic Gradient Calibration Method for Continual Learning [11.555822066922508]
Continual learning (CL) is a fundamental topic in machine learning, where the goal is to train a model with continuously incoming data and tasks. Due to the memory limit, we cannot store all the historical data, and therefore confront the catastrophic forgetting'' problem. We develop an effective algorithm to calibrate the gradient in each updating step of the model.
arXiv Detail & Related papers (2024-07-30T16:30:09Z)
Continual Learning with Scaled Gradient Projection [8.847574864259391]
In neural networks, continual learning results in gradient interference among sequential tasks, leading to forgetting of old tasks while learning new ones. We propose a Scaled Gradient Projection (SGP) method to improve new learning while minimizing forgetting. We conduct experiments ranging from continual image classification to reinforcement learning tasks and report better performance with less training overhead than the state-of-the-art approaches.
arXiv Detail & Related papers (2023-02-02T19:46:39Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Gradient Correction beyond Gradient Descent [63.33439072360198]
gradient correction is apparently the most crucial aspect for the training of a neural network. We introduce a framework (textbfGCGD) to perform gradient correction. Experiment results show that our gradient correction framework can effectively improve the gradient quality to reduce training epochs by $sim$ 20% and also improve the network performance.
arXiv Detail & Related papers (2022-03-16T01:42:25Z)
Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning [13.937644559223548]
How to train deep neural networks (DNNs) to generalize well is a central concern in deep learning. We propose an effective method to improve the model generalization by penalizing the gradient norm of loss function during optimization.
arXiv Detail & Related papers (2022-02-08T02:03:45Z)
Bilevel learning of l1-regularizers with closed-form gradients(BLORC) [8.138650738423722]
We present a method for supervised learning of sparsity-promoting regularizers. The parameters are learned to minimize the mean squared error of reconstruction on a training set of ground truth signal and measurement pairs.
arXiv Detail & Related papers (2021-11-21T17:01:29Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels [92.98756432746482]
We study a weakly supervised problem called learning with complementary labels. We show that the quality of gradient estimation matters more in risk minimization. We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
arXiv Detail & Related papers (2020-07-05T04:19:37Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Regularizing Meta-Learning via Gradient Dropout [102.29924160341572]
meta-learning models are prone to overfitting when there are no sufficient training tasks for the meta-learners to generalize. We introduce a simple yet effective method to alleviate the risk of overfitting for gradient-based meta-learning.
arXiv Detail & Related papers (2020-04-13T10:47:02Z)
The Break-Even Point on Optimization Trajectories of Deep Neural Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory. We show that using a large learning rate in the initial phase of training reduces the variance of the gradient. We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.