QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
- URL: http://arxiv.org/abs/2310.07147v1
- Date: Wed, 11 Oct 2023 02:47:40 GMT
- Title: QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
- Authors: Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Qingyi Gu, Kurt
Keutzer
- Abstract summary: Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks.
Fine-tuning these pre-trained models on downstream datasets provides further significant performance gains, but this process has been challenging due to its extraordinary resource requirements.
We propose QFT, a novel Quantized Full- parameter Tuning framework for LLMs that enables memory-efficient fine-tuning without harming performance.
- Score: 37.265708531464746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have showcased remarkable impacts across a wide
spectrum of natural language processing tasks. Fine-tuning these pre-trained
models on downstream datasets provides further significant performance gains,
but this process has been challenging due to its extraordinary resource
requirements. To this end, existing efforts focus on parameter-efficient
fine-tuning, which, unfortunately, fail to capitalize on the powerful potential
of full-parameter fine-tuning. In this work, we propose QFT, a novel Quantized
Full-parameter Tuning framework for LLMs that enables memory-efficient
fine-tuning without harming performance. Our framework incorporates two novel
ideas: (i) we adopt the efficient Lion optimizer, which only keeps track of the
momentum and has consistent update magnitudes for each parameter, an inherent
advantage for robust quantization; and (ii) we quantize all model states and
store them as integer values, and present a gradient flow and parameter update
scheme for the quantized weights. As a result, QFT reduces the model state
memory to 21% of the standard solution while achieving comparable performance,
e.g., tuning a LLaMA-7B model requires only <30GB of memory, satisfied by a
single A6000 GPU.
Related papers
- QEFT: Quantization for Efficient Fine-Tuning of LLMs [9.446971590056945]
We propose a new technique called Quantization for Efficient Fine-Tuning (QEFT)
QEFT accelerates both inference and fine-tuning, is supported by robust theoretical foundations, and maintains good hardware compatibility.
Our experiments demonstrate that QEFT matches the quality and versatility of full-precision parameter-efficient fine-tuning.
arXiv Detail & Related papers (2024-10-11T09:39:33Z) - LoRTA: Low Rank Tensor Adaptation of Large Language Models [70.32218116940393]
Low Rank Adaptation (LoRA) is a popular Efficient Fine Tuning (PEFT) method that effectively adapts large pre-trained models for downstream tasks.
We propose a novel approach that employs a low rank tensor parametrization for model updates.
Our method is both efficient and effective for fine-tuning large language models, achieving a substantial reduction in the number of parameters while maintaining comparable performance.
arXiv Detail & Related papers (2024-10-05T06:59:50Z) - Propulsion: Steering LLM with Tiny Fine-Tuning [0.0]
We propose Propulsion, a novel parameter efficient fine-tuning (PEFT) method to optimize task-specific performance.
Inspired by the concept of controlled adjustments in physical motion, Propulsion selectively re-scales specific dimensions of a pre-trained model.
Our theoretical analysis, supported by Neural Tangent Kernel (NTK) theory, shows that Propulsion approximates the performance of full fine-tuning with far fewer trainable parameters.
arXiv Detail & Related papers (2024-09-17T06:51:59Z) - Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance [20.659750151408186]
Large Language Models (LLMs) have demonstrated impressive performance across various domains.
Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA)
We propose Quantized LLMs with Balanced-rank Adaptation (Q-BaRA) and Quantization-Aware Fine-tuning with Higher Rank Adaptation (QA-HiRA)
arXiv Detail & Related papers (2024-07-24T06:16:37Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.
We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.
EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity [66.67596152389591]
Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models.
In this study, we investigate the feasibility of fine-tuning an extremely small subset of LLM parameters using ZO.
Our results demonstrate that fine-tuning 0.1% sensitive parameters in the LLM with ZO can outperform the full ZO fine-tuning performance.
arXiv Detail & Related papers (2024-06-05T04:07:35Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - Memory-Efficient Fine-Tuning of Compressed Large Language Models via
sub-4-bit Integer Quantization [27.79783067245817]
Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs.
This paper presents Efficient Adaptation and Quantization-aware (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs.
arXiv Detail & Related papers (2023-05-23T15:20:01Z) - Scaling & Shifting Your Features: A New Baseline for Efficient Model
Tuning [126.84770886628833]
Existing finetuning methods either tune all parameters of the pretrained model (full finetuning) or only tune the last linear layer (linear probing)
We propose a new parameter-efficient finetuning method termed as SSF, representing that researchers only need to Scale and Shift the deep Features extracted by a pre-trained model to catch up with the performance full finetuning.
arXiv Detail & Related papers (2022-10-17T08:14:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.