ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language
Models via Efficient Large-Batch Adversarial Noise
- URL: http://arxiv.org/abs/2201.12469v1
- Date: Sat, 29 Jan 2022 01:47:01 GMT
- Title: ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language
Models via Efficient Large-Batch Adversarial Noise
- Authors: Minjia Zhang, Niranjan Uma Naresh, Yuxiong He
- Abstract summary: Large pretrained Transformer-based language models have led to dramatic improvements in many natural language understanding tasks.
ScaLA is a novel and efficient method to accelerate the speed of transformer networks.
Experiment results show that ScaLA attains 2.7-UE-9.8$times$ adaptation speedups over the baseline for GLLA on BERT-base RoBERTa-large.
- Score: 20.779167087445995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, large pre-trained Transformer-based language models have led
to dramatic improvements in many natural language understanding tasks. To train
these models with increasing sizes, many neural network practitioners attempt
to increase the batch sizes in order to leverage multiple GPUs to improve
training speed. However, increasing the batch size often makes the optimization
more difficult, leading to slow convergence or poor generalization that can
require orders of magnitude more training time to achieve the same model
quality. In this paper, we explore the steepness of the loss landscape of
large-batch optimization for adapting pre-trained Transformer-based language
models to domain-specific tasks and find that it tends to be highly complex and
irregular, posing challenges to generalization on downstream tasks.
To tackle this challenge, we propose ScaLA, a novel and efficient method to
accelerate the adaptation speed of pre-trained transformer networks. Different
from prior methods, we take a sequential game-theoretic approach by adding
lightweight adversarial noise into large-batch optimization, which
significantly improves adaptation speed while preserving model generalization.
Experiment results show that ScaLA attains 2.7--9.8$\times$ adaptation speedups
over the baseline for GLUE on BERT-base and RoBERTa-large, while achieving
comparable and sometimes higher accuracy than the state-of-the-art large-batch
optimization methods. Finally, we also address the theoretical aspect of
large-batch optimization with adversarial noise and provide a theoretical
convergence rate analysis for ScaLA using techniques for analyzing non-convex
saddle-point problems.
Related papers
- AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning [22.950914612765494]
Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks.
Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph.
We propose the Adaptive Zeroth-order-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods.
arXiv Detail & Related papers (2024-06-26T04:33:13Z) - DiJiang: Efficient Large Language Models through Compact Kernelization [30.24187657746638]
We present a novel Frequency Domain Kernelization approach that enables the transformation of a pre-trained vanilla Transformer into a linear complexity model with little training costs.
Experiments demonstrate that the proposed method achieves comparable performance to the original Transformer, but with significantly reduced training costs and much faster inference speeds.
arXiv Detail & Related papers (2024-03-29T02:32:15Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Multiplicative update rules for accelerating deep learning training and
increasing robustness [69.90473612073767]
We propose an optimization framework that fits to a wide range of machine learning algorithms and enables one to apply alternative update rules.
We claim that the proposed framework accelerates training, while leading to more robust models in contrast to traditionally used additive update rule.
arXiv Detail & Related papers (2023-07-14T06:44:43Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - Accelerating Training of Transformer-Based Language Models with
Progressive Layer Dropping [24.547833264405355]
The proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 times faster than the baseline.
While being faster, our pre-trained models are equipped with strong knowledge transferability, achieving comparable and sometimes higher GLUE score than the baseline.
arXiv Detail & Related papers (2020-10-26T06:50:07Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.