Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized
Large Language Models
- URL: http://arxiv.org/abs/2401.07159v1
- Date: Sat, 13 Jan 2024 21:00:21 GMT
- Title: Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized
Large Language Models
- Authors: Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong
Jiang, and Zhihao Jia
- Abstract summary: Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks.
Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, or attempt to reduce the memory footprint during the training phase.
We present Quantized Side Tuing (QST), which enables memory-efficient and fast finetuning of LLMs by operating through a dual-stage process.
- Score: 37.516453975389624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Finetuning large language models (LLMs) has been empirically effective on a
variety of downstream tasks. Existing approaches to finetuning an LLM either
focus on parameter-efficient finetuning, which only updates a small number of
trainable parameters, or attempt to reduce the memory footprint during the
training phase of the finetuning. Typically, the memory footprint during
finetuning stems from three contributors: model weights, optimizer states, and
intermediate activations. However, existing works still require considerable
memory and none can simultaneously mitigate memory footprint for all three
sources. In this paper, we present Quantized Side Tuing (QST), which enables
memory-efficient and fast finetuning of LLMs by operating through a dual-stage
process. First, QST quantizes an LLM's model weights into 4-bit to reduce the
memory footprint of the LLM's original weights; QST also introduces a side
network separated from the LLM, which utilizes the hidden states of the LLM to
make task-specific predictions. Using a separate side network avoids performing
backpropagation through the LLM, thus reducing the memory requirement of the
intermediate activations. Furthermore, QST leverages several low-rank adaptors
and gradient-free downsample modules to significantly reduce the trainable
parameters, so as to save the memory footprint of the optimizer states.
Experiments show that QST can reduce the total memory footprint by up to 2.3
$\times$ and speed up the finetuning process by up to 3 $\times$ while
achieving competent performance compared with the state-of-the-art. When it
comes to full finetuning, QST can reduce the total memory footprint up to 7
$\times$.
Related papers
- Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language.
LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint.
New frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70%.
arXiv Detail & Related papers (2025-03-10T09:26:08Z) - Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.
High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.
We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.
At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - Low-Rank Quantization-Aware Training for LLMs [8.535254310145005]
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands.
We propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs.
Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage.
arXiv Detail & Related papers (2024-06-10T15:44:22Z) - Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity [66.67596152389591]
Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models.
In this study, we investigate the feasibility of fine-tuning an extremely small subset of LLM parameters using ZO.
Our results demonstrate that fine-tuning 0.1% sensitive parameters in the LLM with ZO can outperform the full ZO fine-tuning performance.
arXiv Detail & Related papers (2024-06-05T04:07:35Z) - QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources [37.265708531464746]
Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks.
Fine-tuning these pre-trained models on downstream datasets provides further significant performance gains, but this process has been challenging due to its extraordinary resource requirements.
We propose QFT, a novel Quantized Full- parameter Tuning framework for LLMs that enables memory-efficient fine-tuning without harming performance.
arXiv Detail & Related papers (2023-10-11T02:47:40Z) - ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with
Modular Quantizers [38.16040503271727]
We propose a memory-efficient finetuning algorithm for large language models (LLMs)
lploraattains competitive performance on text classification, natural language inference, and instruction following tasks using significantly less memory than existing approaches.
We also surpass the state-of-the-art ROUGE score on a popular summarization task.
arXiv Detail & Related papers (2023-09-28T02:55:01Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z) - Memory-Efficient Fine-Tuning of Compressed Large Language Models via
sub-4-bit Integer Quantization [27.79783067245817]
Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs.
This paper presents Efficient Adaptation and Quantization-aware (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs.
arXiv Detail & Related papers (2023-05-23T15:20:01Z) - LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer
Learning [82.93130407930762]
It is costly to update the entire parameter set of large pre-trained models.
PETL techniques allow updating a small subset of parameters inside a pre-trained backbone network for a new task.
We propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts.
arXiv Detail & Related papers (2022-06-13T23:51:56Z) - LiST: Lite Self-training Makes Efficient Few-shot Learners [91.28065455714018]
LiST improves by 35% over classic fine-tuning methods and 6% over prompt-tuning with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each target domain.
arXiv Detail & Related papers (2021-10-12T18:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.