Gradient Sparsification For Masked Fine-Tuning of Transformers
- URL: http://arxiv.org/abs/2307.10098v1
- Date: Wed, 19 Jul 2023 16:13:13 GMT
- Title: Gradient Sparsification For Masked Fine-Tuning of Transformers
- Authors: James O' Neill and Sourav Dutta
- Abstract summary: Fine-tuning pretrained self-supervised language models is widely adopted for transfer learning to downstream tasks.
Gradual unfreezing makes a trade-off between the two by gradually unfreezing gradients of whole layers during training.
It is not clear whether gradually unfreezing layers throughout training is optimal, compared to sparse variants of gradual unfreezing.
- Score: 6.936564049727831
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning pretrained self-supervised language models is widely adopted for
transfer learning to downstream tasks. Fine-tuning can be achieved by freezing
gradients of the pretrained network and only updating gradients of a newly
added classification layer, or by performing gradient updates on all
parameters. Gradual unfreezing makes a trade-off between the two by gradually
unfreezing gradients of whole layers during training. This has been an
effective strategy to trade-off between storage and training speed with
generalization performance. However, it is not clear whether gradually
unfreezing layers throughout training is optimal, compared to sparse variants
of gradual unfreezing which may improve fine-tuning performance. In this paper,
we propose to stochastically mask gradients to regularize pretrained language
models for improving overall fine-tuned performance. We introduce GradDrop and
variants thereof, a class of gradient sparsification methods that mask
gradients during the backward pass, acting as gradient noise. GradDrop is
sparse and stochastic unlike gradual freezing. Extensive experiments on the
multilingual XGLUE benchmark with XLMR-Large show that GradDrop is competitive
against methods that use additional translated data for intermediate
pretraining and outperforms standard fine-tuning and gradual unfreezing. A
post-analysis shows how GradDrop improves performance with languages it was not
trained on, such as under-resourced languages.
Related papers
- AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning [9.51289606759621]
Training and fine-tuning large language models (LLMs) come with challenges related to memory and computational requirements.
Various techniques have been developed to tackle these challenges, such as low-rank adaptation (LoRA)
We introduce a new method inspired by a phenomenon we formally prove: as training progresses, the rank of the estimated gradient gradually decreases.
arXiv Detail & Related papers (2024-10-23T13:53:26Z) - How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought.
Exploiting this structure can significantly improve gradient-free optimization schemes.
We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z) - Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep
Learning in a Supercomputing Environment [0.6091702876917281]
gradient sparsification has been proposed to reduce the communication traffic significantly.
Top-k gradient sparsification (Top-k SGD) has a limit to increase the speed up overall training performance.
We conduct experiments that show the inefficiency of Top-k SGD and provide the insight of the low performance.
arXiv Detail & Related papers (2022-09-18T07:42:31Z) - BBTv2: Pure Black-Box Optimization Can Be Comparable to Gradient Descent
for Few-Shot Learning [83.26610968655815]
Black-Box Tuning is a derivative-free approach to optimize continuous prompt tokens prepended to the input of language models.
We present BBTv2, a pure black-box optimization approach that can drive language models to achieve comparable results to gradient-based optimization.
arXiv Detail & Related papers (2022-05-23T11:10:19Z) - Gradient Correction beyond Gradient Descent [63.33439072360198]
gradient correction is apparently the most crucial aspect for the training of a neural network.
We introduce a framework (textbfGCGD) to perform gradient correction.
Experiment results show that our gradient correction framework can effectively improve the gradient quality to reduce training epochs by $sim$ 20% and also improve the network performance.
arXiv Detail & Related papers (2022-03-16T01:42:25Z) - On Training Implicit Models [75.20173180996501]
We propose a novel gradient estimate for implicit models, named phantom gradient, that forgoes the costly computation of the exact gradient.
Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times.
arXiv Detail & Related papers (2021-11-09T14:40:24Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Decreasing scaling transition from adaptive gradient descent to
stochastic gradient descent [1.7874193862154875]
We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda.
Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
arXiv Detail & Related papers (2021-06-12T11:28:58Z) - SSGD: A safe and efficient method of gradient descent [0.5099811144731619]
gradient descent method plays an important role in solving various optimization problems.
Super gradient descent approach to update parameters by concealing the length of gradient.
Our algorithm can defend against attacks on the gradient.
arXiv Detail & Related papers (2020-12-03T17:09:20Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.