Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized
Large Language Models
- URL: http://arxiv.org/abs/2401.07159v1
- Date: Sat, 13 Jan 2024 21:00:21 GMT
- Title: Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized
Large Language Models
- Authors: Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong
Jiang, and Zhihao Jia
- Abstract summary: Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks.
Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, or attempt to reduce the memory footprint during the training phase.
We present Quantized Side Tuing (QST), which enables memory-efficient and fast finetuning of LLMs by operating through a dual-stage process.
- Score: 37.516453975389624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Finetuning large language models (LLMs) has been empirically effective on a
variety of downstream tasks. Existing approaches to finetuning an LLM either
focus on parameter-efficient finetuning, which only updates a small number of
trainable parameters, or attempt to reduce the memory footprint during the
training phase of the finetuning. Typically, the memory footprint during
finetuning stems from three contributors: model weights, optimizer states, and
intermediate activations. However, existing works still require considerable
memory and none can simultaneously mitigate memory footprint for all three
sources. In this paper, we present Quantized Side Tuing (QST), which enables
memory-efficient and fast finetuning of LLMs by operating through a dual-stage
process. First, QST quantizes an LLM's model weights into 4-bit to reduce the
memory footprint of the LLM's original weights; QST also introduces a side
network separated from the LLM, which utilizes the hidden states of the LLM to
make task-specific predictions. Using a separate side network avoids performing
backpropagation through the LLM, thus reducing the memory requirement of the
intermediate activations. Furthermore, QST leverages several low-rank adaptors
and gradient-free downsample modules to significantly reduce the trainable
parameters, so as to save the memory footprint of the optimizer states.
Experiments show that QST can reduce the total memory footprint by up to 2.3
$\times$ and speed up the finetuning process by up to 3 $\times$ while
achieving competent performance compared with the state-of-the-art. When it
comes to full finetuning, QST can reduce the total memory footprint up to 7
$\times$.
Related papers
- MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences.
MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage.
arXiv Detail & Related papers (2024-07-22T01:52:30Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.
At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - Low-Rank Quantization-Aware Training for LLMs [8.535254310145005]
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands.
We propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs.
Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage.
arXiv Detail & Related papers (2024-06-10T15:44:22Z) - Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity [66.67596152389591]
Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models.
In this study, we investigate the feasibility of fine-tuning an extremely small subset of LLM parameters using ZO.
Our results demonstrate that fine-tuning 0.1% sensitive parameters in the LLM with ZO can outperform the full ZO fine-tuning performance.
arXiv Detail & Related papers (2024-06-05T04:07:35Z) - QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources [37.265708531464746]
Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks.
Fine-tuning these pre-trained models on downstream datasets provides further significant performance gains, but this process has been challenging due to its extraordinary resource requirements.
We propose QFT, a novel Quantized Full- parameter Tuning framework for LLMs that enables memory-efficient fine-tuning without harming performance.
arXiv Detail & Related papers (2023-10-11T02:47:40Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z) - Memory-Efficient Fine-Tuning of Compressed Large Language Models via
sub-4-bit Integer Quantization [27.79783067245817]
Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs.
This paper presents Efficient Adaptation and Quantization-aware (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs.
arXiv Detail & Related papers (2023-05-23T15:20:01Z) - LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer
Learning [82.93130407930762]
It is costly to update the entire parameter set of large pre-trained models.
PETL techniques allow updating a small subset of parameters inside a pre-trained backbone network for a new task.
We propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts.
arXiv Detail & Related papers (2022-06-13T23:51:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.