RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization
- URL: http://arxiv.org/abs/2407.08044v2
- Date: Thu, 26 Sep 2024 23:47:03 GMT
- Title: RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization
- Authors: Xijie Huang, Zechun Liu, Shih-Yang Liu, Kwang-Ting Cheng,
- Abstract summary: We propose RoLoRA, the first LoRA-based scheme for effective weight-activation quantization.
We evaluate RoLoRA across LLaMA2-7B/13B, LLaMA3-8B models, achieving up to 29.5% absolute accuracy gain of 4-bit weight-activation quantized LLaMA2- 13B.
- Score: 38.23587031169402
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Low-Rank Adaptation (LoRA), as a representative Parameter-Efficient Fine-Tuning (PEFT)method, significantly enhances the training efficiency by updating only a small portion of the weights in Large Language Models (LLMs). Recently, weight-only quantization techniques have also been applied to LoRA methods to reduce the memory footprint of fine-tuning. However, applying weight-activation quantization to the LoRA pipeline is under-explored, and we observe substantial performance degradation primarily due to the presence of activation outliers. In this work, we propose RoLoRA, the first LoRA-based scheme for effective weight-activation quantization. RoLoRA utilizes rotation for outlier elimination and proposes rotation-aware fine-tuning to preserve the outlier-free characteristics in rotated LLMs. Experimental results show RoLoRA consistently improves low-bit LoRA convergence and post-training quantization robustness in weight-activation settings. We evaluate RoLoRA across LLaMA2-7B/13B, LLaMA3-8B models, achieving up to 29.5% absolute accuracy gain of 4-bit weight-activation quantized LLaMA2- 13B on commonsense reasoning tasks compared to LoRA baseline. We further demonstrate its effectiveness on Large Multimodal Models (LLaVA-1.5-7B). Codes are available at https://github.com/HuangOwen/RoLoRA
Related papers
- LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization [78.93425154518705]
Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements.
This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization.
arXiv Detail & Related papers (2024-10-27T22:57:12Z) - Randomized Asymmetric Chain of LoRA: The First Meaningful Theoretical Framework for Low-Rank Adaptation [58.288682735160585]
Low-Rank Adaptation (LoRA) is a popular technique for finetuning models.
LoRA often under performs when compared to full- parameter fine-tuning.
We present a framework that rigorously analyzes the adaptation rates of LoRA methods.
arXiv Detail & Related papers (2024-10-10T18:51:53Z) - Bone: Block-Affine Adaptation of Large Language Models [0.0]
Low-Rank Adaptation (LoRA) has achieved remarkable training results by freezing the original weights and training only low-rank matrices.
This paper introduces a novel PEFT technique distinct from LoRA, called Block-Affine Adaptation (Bone)
Bone significantly reduces memory usage and achieves faster computation.
arXiv Detail & Related papers (2024-09-19T10:26:42Z) - ResLoRA: Identity Residual Mapping in Low-Rank Adaption [96.59370314485074]
We propose ResLoRA, an improved framework of low-rank adaptation (LoRA)
Our method can achieve better results in fewer training steps without any extra trainable parameters or inference cost compared to LoRA.
The experiments on NLG, NLU, and text-to-image tasks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-02-28T04:33:20Z) - DoRA: Weight-Decomposed Low-Rank Adaptation [57.68678247436207]
We introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA.
Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA)
DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning.
arXiv Detail & Related papers (2024-02-14T17:59:34Z) - LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation [27.123271324468657]
Low-Rank Adaptation (LoRA) is currently the most commonly used.
efficient fine-tuning (PEFT) method.
It introduces auxiliary parameters for each layer to fine-tune the pre-trained model under limited computing resources.
However, it still faces resource consumption challenges when scaling up to larger models.
arXiv Detail & Related papers (2024-02-12T15:34:56Z) - LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models
Fine-tuning [19.08716369943138]
We present LoRA-FA, a memory-efficient fine-tuning method that reduces the activation memory without performance degradation and expensive recomputation.
Our results show that LoRA-FA can always achieve close fine-tuning accuracy across different tasks compared to full parameter fine-tuning and LoRA.
arXiv Detail & Related papers (2023-08-07T05:12:27Z) - LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs)
LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner.
LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.