RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy
- URL: http://arxiv.org/abs/2412.01129v2
- Date: Thu, 05 Dec 2024 05:05:01 GMT
- Title: RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy
- Authors: Geonho Lee, Janghwan Lee, Sukjin Hong, Minsoo Kim, Euijai Ahn, Du-Seong Chang, Jungwook Choi,
- Abstract summary: Low-rank adaptation (LoRA) has become the dominant method for parameter-efficient LLM fine-tuning.
LoRA-based quantization error compensation (LQEC) emerging as a powerful tool for recovering accuracy in compressed LLMs.
We propose RILQ (Rank-Insensitive LoRA-based Quantization Error Compensation) to understand fundamental limitation and boost 2-bit LLM accuracy.
- Score: 5.767260383077013
- License:
- Abstract: Low-rank adaptation (LoRA) has become the dominant method for parameter-efficient LLM fine-tuning, with LoRA-based quantization error compensation (LQEC) emerging as a powerful tool for recovering accuracy in compressed LLMs. However, LQEC has underperformed in sub-4-bit scenarios, with no prior investigation into understanding this limitation. We propose RILQ (Rank-Insensitive LoRA-based Quantization Error Compensation) to understand fundamental limitation and boost 2-bit LLM accuracy. Based on rank analysis revealing model-wise activation discrepancy loss's rank-insensitive nature, RILQ employs this loss to adjust adapters cooperatively across layers, enabling robust error compensation with low-rank adapters. Evaluations on LLaMA-2 and LLaMA-3 demonstrate RILQ's consistent improvements in 2-bit quantized inference across various state-of-the-art quantizers and enhanced accuracy in task-specific fine-tuning. RILQ maintains computational efficiency comparable to existing LoRA methods, enabling adapter-merged weight-quantized LLM inference with significantly enhanced accuracy, making it a promising approach for boosting 2-bit LLM performance.
Related papers
- Dynamic Low-Rank Sparse Adaptation for Large Language Models [54.1231638555233]
Low-rank Sparse Adaptation (LoSA) is a novel method that seamlessly integrates low-rank adaptation into sparse LLM sparsity.
LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning.
LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden.
arXiv Detail & Related papers (2025-02-20T18:37:32Z) - CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization [2.975939846457057]
Fine-tuning large language models (LLMs) using low-rank adaptation (LoRA) has become a highly efficient approach for downstream tasks.
Applying LoRA techniques to quantized LLMs poses unique challenges due to the reduced representational precision of quantized weights.
We introduce CLoQ, a simplistic initialization strategy designed to overcome these challenges.
arXiv Detail & Related papers (2025-01-30T16:48:15Z) - LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.
We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.
LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z) - LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization [78.93425154518705]
Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements.
This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization.
arXiv Detail & Related papers (2024-10-27T22:57:12Z) - Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs [75.11449420928139]
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks.
Low-Rank Adaptation (LoRA) has emerged as a promising solution, but there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum.
We propose eXtreme Gradient Boosting LoRA, a novel framework that bridges this gap by leveraging the power of ensemble learning.
arXiv Detail & Related papers (2024-10-25T17:07:13Z) - Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance [20.659750151408186]
Large Language Models (LLMs) have demonstrated impressive performance across various domains.
Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA)
We propose Quantized LLMs with Balanced-rank Adaptation (Q-BaRA) and Quantization-Aware Fine-tuning with Higher Rank Adaptation (QA-HiRA)
arXiv Detail & Related papers (2024-07-24T06:16:37Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - Accurate LoRA-Finetuning Quantization of LLMs via Information Retention [21.925671783061542]
This paper proposes a novel IR-QLoRA for pushing quantized LLMs with LoRA to be highly accurate through information retention.
Experiments show that IR-QLoRA can significantly improve accuracy across LLaMA and LLaMA2 families under 2-4 bit-widths, e.g., 4- bit LLaMA-7B 1.4% improvement on MMLU.
arXiv Detail & Related papers (2024-02-08T06:53:31Z) - QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models [85.02796681773447]
We propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm.
The motivation lies in the imbalanced degrees of freedom of quantization and adaptation.
QA-LoRA is easily implemented with a few lines of code.
arXiv Detail & Related papers (2023-09-26T07:22:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.