L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models
- URL: http://arxiv.org/abs/2402.04902v3
- Date: Wed, 22 May 2024 20:23:54 GMT
- Title: L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models
- Authors: Hyesung Jeon, Yulhwa Kim, Jae-joon Kim,
- Abstract summary: Post-training quantization (PTQ) is more commonly used in previous works than quantization-aware training (QAT)
By design, L4Q allows quantization parameters to reflect weight updates, while weight updates reduce quantization errors.
Our experiments demonstrate that this coupled quantization and fine-tuning approach yields superior accuracy compared to decoupled fine-tuning schemes in sub-4-bit quantization.
- Score: 5.304907804008533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the high memory and computational costs associated with Large Language Models, model compression via quantization and parameter-efficient fine-tuning (PEFT) methods, such as low-rank adaptation (LoRA), are gaining popularity. This has led to active research on quantization-aware PEFT techniques, which aim to create models with high accuracy and low memory overhead. Among quantization methods, post-training quantization (PTQ) is more commonly used in previous works than quantization-aware training (QAT), despite QAT's potential for higher accuracy. This preference is due to PTQ's low training overhead. However, PTQ-based PEFT methods often utilize high-precision parameters, making it difficult to fully exploit the efficiency of quantization. Additionally, they have limited adaptation ability due to a reduced and constrained LoRA parameter structure. To overcome these challenges, we propose L4Q, which leverages joint quantization and fine-tuning to reduce QAT's memory overhead and produce models that consist entirely of quantized weights while achieving effective adaptation to downstream tasks. By design, L4Q allows quantization parameters to reflect weight updates, while weight updates reduce quantization errors. Our experiments demonstrate that this coupled quantization and fine-tuning approach yields superior accuracy compared to decoupled fine-tuning schemes in sub-4-bit quantization. Using the LLaMA model families and instructional datasets, we showcase L4Q's capabilities in language tasks and few-shot in-context learning.
Related papers
- LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices [41.17378536966264]
Low-Rank Quantization $-$ is a simple yet effective post-training weight quantization method for large language models.
Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights.
We show the superiority of LRQ over prior LLM PTQ works under (i) $8$-bit weight and per-tensor activation quantization, (ii) $4$-bit weight and $8$-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes.
arXiv Detail & Related papers (2024-07-16T09:32:07Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [62.904403513409484]
Large language models (LLMs) are integral to modern natural language processing and artificial intelligence.
We propose Efficient Quantization-Aware Training (EfficientQAT), a novel quantization technique for compressing LLMs.
Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models.
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization [62.15918574997175]
It is known that language models contain outlier channels whose values on average are orders of magnitude higher than other channels.
We propose a strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization.
We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights.
arXiv Detail & Related papers (2024-04-04T17:25:30Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.
We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.
CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - Memory-Efficient Fine-Tuning of Compressed Large Language Models via
sub-4-bit Integer Quantization [27.79783067245817]
Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs.
This paper presents Efficient Adaptation and Quantization-aware (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs.
arXiv Detail & Related papers (2023-05-23T15:20:01Z) - QFT: Post-training quantization via fast joint finetuning of all degrees
of freedom [1.1744028458220428]
We rethink quantized network parameterization in HW-aware fashion, towards a unified analysis of all quantization DoF.
Our single-step simple and extendable method, dubbed quantization-aware finetuning (QFT), achieves 4-bit weight quantization results on-par with SoTA.
arXiv Detail & Related papers (2022-12-05T22:38:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.