QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
- URL: http://arxiv.org/abs/2310.07147v1
- Date: Wed, 11 Oct 2023 02:47:40 GMT
- Title: QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
- Authors: Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Qingyi Gu, Kurt
Keutzer
- Abstract summary: Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks.
Fine-tuning these pre-trained models on downstream datasets provides further significant performance gains, but this process has been challenging due to its extraordinary resource requirements.
We propose QFT, a novel Quantized Full- parameter Tuning framework for LLMs that enables memory-efficient fine-tuning without harming performance.
- Score: 37.265708531464746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have showcased remarkable impacts across a wide
spectrum of natural language processing tasks. Fine-tuning these pre-trained
models on downstream datasets provides further significant performance gains,
but this process has been challenging due to its extraordinary resource
requirements. To this end, existing efforts focus on parameter-efficient
fine-tuning, which, unfortunately, fail to capitalize on the powerful potential
of full-parameter fine-tuning. In this work, we propose QFT, a novel Quantized
Full-parameter Tuning framework for LLMs that enables memory-efficient
fine-tuning without harming performance. Our framework incorporates two novel
ideas: (i) we adopt the efficient Lion optimizer, which only keeps track of the
momentum and has consistent update magnitudes for each parameter, an inherent
advantage for robust quantization; and (ii) we quantize all model states and
store them as integer values, and present a gradient flow and parameter update
scheme for the quantized weights. As a result, QFT reduces the model state
memory to 21% of the standard solution while achieving comparable performance,
e.g., tuning a LLaMA-7B model requires only <30GB of memory, satisfied by a
single A6000 GPU.
Related papers
- Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance [20.659750151408186]
Large Language Models (LLMs) have demonstrated impressive performance across various domains.
Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA)
We propose Quantized LLMs with Balanced-rank Adaptation (Q-BaRA) and Quantization-Aware Fine-tuning with Higher Rank Adaptation (QA-HiRA)
arXiv Detail & Related papers (2024-07-24T06:16:37Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [62.904403513409484]
Large language models (LLMs) are integral to modern natural language processing and artificial intelligence.
We propose Efficient Quantization-Aware Training (EfficientQAT), a novel quantization technique for compressing LLMs.
Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models.
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - Low-Rank Quantization-Aware Training for LLMs [8.535254310145005]
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands.
We propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs.
Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage.
arXiv Detail & Related papers (2024-06-10T15:44:22Z) - Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity [66.67596152389591]
Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models.
In this study, we investigate the feasibility of fine-tuning an extremely small subset of LLM parameters using ZO.
Our results demonstrate that fine-tuning 0.1% sensitive parameters in the LLM with ZO can outperform the full ZO fine-tuning performance.
arXiv Detail & Related papers (2024-06-05T04:07:35Z) - EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models [21.17675493267517]
Post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches to compress and accelerate diffusion models.
We introduce a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency.
Our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency.
arXiv Detail & Related papers (2023-10-05T02:51:53Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - Memory-Efficient Fine-Tuning of Compressed Large Language Models via
sub-4-bit Integer Quantization [27.79783067245817]
Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs.
This paper presents Efficient Adaptation and Quantization-aware (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs.
arXiv Detail & Related papers (2023-05-23T15:20:01Z) - Scaling & Shifting Your Features: A New Baseline for Efficient Model
Tuning [126.84770886628833]
Existing finetuning methods either tune all parameters of the pretrained model (full finetuning) or only tune the last linear layer (linear probing)
We propose a new parameter-efficient finetuning method termed as SSF, representing that researchers only need to Scale and Shift the deep Features extracted by a pre-trained model to catch up with the performance full finetuning.
arXiv Detail & Related papers (2022-10-17T08:14:49Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.