Accurate LoRA-Finetuning Quantization of LLMs via Information Retention
- URL: http://arxiv.org/abs/2402.05445v2
- Date: Mon, 27 May 2024 09:20:35 GMT
- Title: Accurate LoRA-Finetuning Quantization of LLMs via Information Retention
- Authors: Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno,
- Abstract summary: This paper proposes a novel IR-QLoRA for pushing quantized LLMs with LoRA to be highly accurate through information retention.
Experiments show that IR-QLoRA can significantly improve accuracy across LLaMA and LLaMA2 families under 2-4 bit-widths, e.g., 4- bit LLaMA-7B 1.4% improvement on MMLU.
- Score: 21.925671783061542
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The LoRA-finetuning quantization of LLMs has been extensively studied to obtain accurate yet compact LLMs for deployment on resource-constrained hardware. However, existing methods cause the quantized LLM to severely degrade and even fail to benefit from the finetuning of LoRA. This paper proposes a novel IR-QLoRA for pushing quantized LLMs with LoRA to be highly accurate through information retention. The proposed IR-QLoRA mainly relies on two technologies derived from the perspective of unified information: (1) statistics-based Information Calibration Quantization allows the quantized parameters of LLM to retain original information accurately; (2) finetuning-based Information Elastic Connection makes LoRA utilizes elastic representation transformation with diverse information. Comprehensive experiments show that IR-QLoRA can significantly improve accuracy across LLaMA and LLaMA2 families under 2-4 bit-widths, e.g., 4- bit LLaMA-7B achieves 1.4% improvement on MMLU compared with the state-of-the-art methods. The significant performance gain requires only a tiny 0.31% additional time consumption, revealing the satisfactory efficiency of our IR-QLoRA. We highlight that IR-QLoRA enjoys excellent versatility, compatible with various frameworks (e.g., NormalFloat and Integer quantization) and brings general accuracy gains. The code is available at https://github.com/htqin/ir-qlora.
Related papers
- LightPROF: A Lightweight Reasoning Framework for Large Language Model on Knowledge Graph [57.382255728234064]
Large Language Models (LLMs) have impressive capabilities in text understanding and zero-shot reasoning.
Knowledge Graphs (KGs) provide rich and reliable contextual information for the reasoning process of LLMs.
We propose a novel Lightweight and efficient Prompt learning-ReasOning Framework for KGQA (LightPROF)
arXiv Detail & Related papers (2025-04-04T03:03:47Z) - Dynamic Low-Rank Sparse Adaptation for Large Language Models [54.1231638555233]
Low-rank Sparse Adaptation (LoSA) is a novel method that seamlessly integrates low-rank adaptation into sparse LLM sparsity.
LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning.
LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden.
arXiv Detail & Related papers (2025-02-20T18:37:32Z) - LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.
LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.
Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z) - RepLoRA: Reparameterizing Low-Rank Adaptation via the Perspective of Mixture of Experts [37.43961020113692]
Low-rank adaptation (LoRA) has emerged as a powerful method for fine-tuning large-scale foundation models.
This paper presents a theoretical analysis of LoRA by examining its connection to the Mixture of Experts models.
arXiv Detail & Related papers (2025-02-05T10:03:09Z) - CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization [2.975939846457057]
Fine-tuning large language models (LLMs) using low-rank adaptation (LoRA) has become a highly efficient approach for downstream tasks.
Applying LoRA techniques to quantized LLMs poses unique challenges due to the reduced representational precision of quantized weights.
We introduce CLoQ, a simplistic initialization strategy designed to overcome these challenges.
arXiv Detail & Related papers (2025-01-30T16:48:15Z) - RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy [5.767260383077013]
Low-rank adaptation (LoRA) has become the dominant method for parameter-efficient LLM fine-tuning.
LoRA-based quantization error compensation (LQEC) emerging as a powerful tool for recovering accuracy in compressed LLMs.
We propose RILQ (Rank-Insensitive LoRA-based Quantization Error Compensation) to understand fundamental limitation and boost 2-bit LLM accuracy.
arXiv Detail & Related papers (2024-12-02T05:09:56Z) - V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM [32.37720746437661]
Low-rank adaptation (LoRA) offers a promising method to integrate external knowledge into large language models (LMMs)
Existing LoRA model serving is excessively computationally expensive and causes extremely high latency.
We present an end-to-end solution that empowers diverse vision tasks and enriches vision applications with LoRA LMMs.
arXiv Detail & Related papers (2024-11-01T13:43:33Z) - LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization [78.93425154518705]
Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements.
This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization.
arXiv Detail & Related papers (2024-10-27T22:57:12Z) - RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization [38.23587031169402]
We propose RoLoRA, the first LoRA-based scheme for effective weight-activation quantization.
We evaluate RoLoRA across LLaMA2-7B/13B, LLaMA3-8B models, achieving up to 29.5% absolute accuracy gain of 4-bit weight-activation quantized LLaMA2- 13B.
arXiv Detail & Related papers (2024-07-10T20:52:18Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - Enabling Weak LLMs to Judge Response Reliability via Meta Ranking [38.63721941742435]
We propose a novel cross-query-comparison-based method called $textitMeta Ranking$ (MR)
MR assesses reliability by pairwisely ranking the target query-response pair with multiple reference query-response pairs.
We show that MR can enhance strong LLMs' performance in two practical applications: model cascading and instruction tuning.
arXiv Detail & Related papers (2024-02-19T13:57:55Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models [85.02796681773447]
We propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm.
The motivation lies in the imbalanced degrees of freedom of quantization and adaptation.
QA-LoRA is easily implemented with a few lines of code.
arXiv Detail & Related papers (2023-09-26T07:22:23Z) - CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices [78.16679232748196]
We introduce a Compression-Aware LoRA (CA-LoRA) framework to transfer Large Language Models (LLMs) to other tasks.
Experiment results demonstrate that CA-LoRA outperforms the vanilla LoRA methods applied to a compressed LLM.
The source code of CA-LoRA is available at https://github.com/thunlp/CA-LoRA.
arXiv Detail & Related papers (2023-07-15T04:37:11Z) - LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs)
LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner.
LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.