QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
- URL: http://arxiv.org/abs/2310.07147v1
- Date: Wed, 11 Oct 2023 02:47:40 GMT
- Title: QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
- Authors: Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Qingyi Gu, Kurt
Keutzer
- Abstract summary: Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks.
Fine-tuning these pre-trained models on downstream datasets provides further significant performance gains, but this process has been challenging due to its extraordinary resource requirements.
We propose QFT, a novel Quantized Full- parameter Tuning framework for LLMs that enables memory-efficient fine-tuning without harming performance.
- Score: 37.265708531464746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have showcased remarkable impacts across a wide
spectrum of natural language processing tasks. Fine-tuning these pre-trained
models on downstream datasets provides further significant performance gains,
but this process has been challenging due to its extraordinary resource
requirements. To this end, existing efforts focus on parameter-efficient
fine-tuning, which, unfortunately, fail to capitalize on the powerful potential
of full-parameter fine-tuning. In this work, we propose QFT, a novel Quantized
Full-parameter Tuning framework for LLMs that enables memory-efficient
fine-tuning without harming performance. Our framework incorporates two novel
ideas: (i) we adopt the efficient Lion optimizer, which only keeps track of the
momentum and has consistent update magnitudes for each parameter, an inherent
advantage for robust quantization; and (ii) we quantize all model states and
store them as integer values, and present a gradient flow and parameter update
scheme for the quantized weights. As a result, QFT reduces the model state
memory to 21% of the standard solution while achieving comparable performance,
e.g., tuning a LLaMA-7B model requires only <30GB of memory, satisfied by a
single A6000 GPU.
Related papers
- Hyper Compressed Fine-Tuning of Large Foundation Models with Quantum Inspired Adapters [0.0]
emphQuantum-Inspired Adapters, a PEFT approach inspired by Hamming-weight quantum circuits from quantum machine learning literature.
We test our proposed adapters by adapting large language models and large vision transformers on benchmark datasets.
arXiv Detail & Related papers (2025-02-10T13:06:56Z) - Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.
High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.
We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z) - FineGates: LLMs Finetuning with Compression using Stochastic Gates [7.093692674858257]
Large Language Models (LLMs) present significant challenges for full finetuning due to the high computational demands.
Lightweight finetuning techniques have been proposed, like learning low-rank adapter layers.
We propose an adaptor model based on gates that simultaneously sparsify the frozen base model with task-specific adaptation.
arXiv Detail & Related papers (2024-12-17T14:33:05Z) - QEFT: Quantization for Efficient Fine-Tuning of LLMs [9.446971590056945]
We propose a new technique called Quantization for Efficient Fine-Tuning (QEFT)
QEFT accelerates both inference and fine-tuning, is supported by robust theoretical foundations, and maintains good hardware compatibility.
Our experiments demonstrate that QEFT matches the quality and versatility of full-precision parameter-efficient fine-tuning.
arXiv Detail & Related papers (2024-10-11T09:39:33Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.
We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.
EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity [66.67596152389591]
Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models.
In this study, we investigate the feasibility of fine-tuning an extremely small subset of LLM parameters using ZO.
Our results demonstrate that fine-tuning 0.1% sensitive parameters in the LLM with ZO can outperform the full ZO fine-tuning performance.
arXiv Detail & Related papers (2024-06-05T04:07:35Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - Scaling & Shifting Your Features: A New Baseline for Efficient Model
Tuning [126.84770886628833]
Existing finetuning methods either tune all parameters of the pretrained model (full finetuning) or only tune the last linear layer (linear probing)
We propose a new parameter-efficient finetuning method termed as SSF, representing that researchers only need to Scale and Shift the deep Features extracted by a pre-trained model to catch up with the performance full finetuning.
arXiv Detail & Related papers (2022-10-17T08:14:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.