SPT: Fine-Tuning Transformer-based Language Models Efficiently with
Sparsification
- URL: http://arxiv.org/abs/2312.10365v1
- Date: Sat, 16 Dec 2023 07:44:52 GMT
- Title: SPT: Fine-Tuning Transformer-based Language Models Efficiently with
Sparsification
- Authors: Yuntao Gui, Xiao Yan, Peiqi Yin, Han Yang, James Cheng
- Abstract summary: Fine-tuning Transformer-based models for downstream tasks has long running time and high memory consumption.
We propose the SPT system to fine-tune Transformer-based models efficiently by introducing sparsity.
SPT consistently outperforms well-optimized baselines, reducing the peak memory consumption by up to 50% and accelerating fine-tuning by up to 2.2x.
- Score: 14.559316921646356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based large language models (e.g., BERT and GPT) achieve great
success, and fine-tuning, which tunes a pre-trained model on a task-specific
dataset, is the standard practice to utilize these models for downstream tasks.
However, Transformer fine-tuning has long running time and high memory
consumption due to the large size of the models. We propose the SPT system to
fine-tune Transformer-based models efficiently by introducing sparsity. We
observe that the memory consumption of Transformer mainly comes from storing
attention weights for multi-head attention (MHA), and the majority of running
time is spent on feed-forward network (FFN). Thus, we design the sparse MHA
module, which computes and stores only large attention weights to reduce memory
consumption, and the routed FFN module, which dynamically activates a subset of
model parameters for each token to reduce computation cost. We implement SPT on
PyTorch and customize CUDA kernels to run sparse MHA and routed FFN
efficiently. Specifically, we use product quantization to identify the large
attention weights and compute attention via sparse matrix multiplication for
sparse MHA. For routed FFN, we batch the tokens according to their activated
model parameters for efficient computation. We conduct extensive experiments to
evaluate SPT on various model configurations. The results show that SPT
consistently outperforms well-optimized baselines, reducing the peak memory
consumption by up to 50% and accelerating fine-tuning by up to 2.2x.
Related papers
- Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage.
To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices.
A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z) - XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection [30.687511115573038]
tool is a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models.
tool can enhance model performance while decreasing the computation load at MoE layers by over 50% without sacrificing performance.
arXiv Detail & Related papers (2024-02-27T08:18:02Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - READ: Recurrent Adaptation of Large Transformers [7.982905666062059]
Fine-tuning large-scale Transformers becomes impractical as the model size and number of tasks increase.
We introduce textbfREcurrent textbfADaption (READ) -- a lightweight and memory-efficient fine-tuning method.
arXiv Detail & Related papers (2023-05-24T16:59:41Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a.
sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training.
Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.