Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of
Language Model
- URL: http://arxiv.org/abs/2305.15265v2
- Date: Sat, 9 Dec 2023 17:32:13 GMT
- Title: Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of
Language Model
- Authors: Zirui Liu, Guanchu Wang, Shaochen Zhong, Zhaozhuo Xu, Daochen Zha,
Ruixiang Tang, Zhimeng Jiang, Kaixiong Zhou, Vipin Chaudhary, Shuai Xu, Xia
Hu
- Abstract summary: We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
- Score: 92.55145016562867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid growth in model size, fine-tuning the large pre-trained
language model has become increasingly difficult due to its extensive memory
usage. Previous works usually focus on reducing the number of trainable
parameters in the network. While the model parameters do contribute to memory
usage, the primary memory bottleneck during training arises from storing
feature maps, also known as activations, as they are crucial for gradient
calculation. Notably, neural networks are usually trained using stochastic
gradient descent. We argue that in stochastic optimization, models can handle
noisy gradients as long as the gradient estimator is unbiased with reasonable
variance. Following this motivation, we propose a new family of unbiased
estimators called WTA-CRS, for matrix production with reduced variance, which
only requires storing the sub-sampled activations for calculating the gradient.
Our work provides both theoretical and experimental evidence that, in the
context of tuning transformers, our proposed estimators exhibit lower variance
compared to existing ones. By replacing the linear operation with our
approximated one in transformers, we can achieve up to 2.7$\times$ peak memory
reduction with almost no accuracy drop and enables up to $6.4\times$ larger
batch size. Under the same hardware, WTA-CRS enables better down-streaming task
performance by applying larger models and/or faster training speed with larger
batch sizes.
Related papers
- Thinking Forward: Memory-Efficient Federated Finetuning of Language Models [21.438831528354513]
This work introduces Spry, an FL algorithm that splits trainable weights of an LLM among participating clients.
Spry achieves a low memory footprint, high accuracy, and fast convergence.
Spry makes feasible previously impossible FL deployments on commodity mobile and edge devices.
arXiv Detail & Related papers (2024-05-24T13:37:48Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - SPT: Fine-Tuning Transformer-based Language Models Efficiently with
Sparsification [14.559316921646356]
Fine-tuning Transformer-based models for downstream tasks has long running time and high memory consumption.
We propose the SPT system to fine-tune Transformer-based models efficiently by introducing sparsity.
SPT consistently outperforms well-optimized baselines, reducing the peak memory consumption by up to 50% and accelerating fine-tuning by up to 2.2x.
arXiv Detail & Related papers (2023-12-16T07:44:52Z) - Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized
Language Model Finetuning Using Shared Randomness [86.61582747039053]
Language model training in distributed settings is limited by the communication cost of exchanges.
We extend recent work using shared randomness to perform distributed fine-tuning with low bandwidth.
arXiv Detail & Related papers (2023-06-16T17:59:51Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Scatterbrain: Unifying Sparse and Low-rank Attention Approximation [25.375024028636663]
We propose Scatterbrain, a novel way to unify sparse (via locality sensitive hashing) and low-rank (via kernel feature map) attention for accurate approximation.
We empirically show that Scatterbrain can achieve 2.1x lower error than baselines when serving as a drop-in replacement in BigGAN image generation and pre-trained T2T-ViT.
We demonstrate Scatterbrain for end-to-end training with up to 4 points better perplexity and 5 points better average accuracy than sparse or low-rank efficient transformers on language modeling and long-range-arena tasks.
arXiv Detail & Related papers (2021-10-28T17:52:17Z) - Comparing Classes of Estimators: When does Gradient Descent Beat Ridge
Regression in Linear Models? [46.01087792062936]
We compare classes of estimators via the relative performance of the emphbest method in the class
This allows us to rigorously quantify the tuning sensitivity of learning algorithms.
arXiv Detail & Related papers (2021-08-26T16:01:37Z) - Large Scale Private Learning via Low-rank Reparametrization [77.38947817228656]
We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks.
We are the first able to apply differential privacy on the BERT model and achieve an average accuracy of $83.9%$ on four downstream tasks.
arXiv Detail & Related papers (2021-06-17T10:14:43Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.