LoRA-GA: Low-Rank Adaptation with Gradient Approximation
- URL: http://arxiv.org/abs/2407.05000v2
- Date: Tue, 16 Jul 2024 07:32:23 GMT
- Title: LoRA-GA: Low-Rank Adaptation with Gradient Approximation
- Authors: Shaowen Wang, Linxi Yu, Jian Li,
- Abstract summary: Fine-tuning large-scale pretrained models is prohibitively expensive in terms of computational and memory costs.
LoRA offers a cost-effective alternative by fine-tuning an auxiliary low-rank model that has significantly fewer parameters.
LoRA converges at a considerably slower rate compared to full fine-tuning, leading to increased overall compute and often worse test performance.
- Score: 5.685201910521295
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Fine-tuning large-scale pretrained models is prohibitively expensive in terms of computational and memory costs. LoRA, as one of the most popular Parameter-Efficient Fine-Tuning (PEFT) methods, offers a cost-effective alternative by fine-tuning an auxiliary low-rank model that has significantly fewer parameters. Although LoRA reduces the computational and memory requirements significantly at each iteration, extensive empirical evidence indicates that it converges at a considerably slower rate compared to full fine-tuning, ultimately leading to increased overall compute and often worse test performance. In our paper, we perform an in-depth investigation of the initialization method of LoRA and show that careful initialization (without any change of the architecture and the training algorithm) can significantly enhance both efficiency and performance. In particular, we introduce a novel initialization method, LoRA-GA (Low Rank Adaptation with Gradient Approximation), which aligns the gradients of low-rank matrix product with those of full fine-tuning at the first step. Our extensive experiments demonstrate that LoRA-GA achieves a convergence rate comparable to that of full fine-tuning (hence being significantly faster than vanilla LoRA as well as various recent improvements) while simultaneously attaining comparable or even better performance. For example, on the subset of the GLUE dataset with T5-Base, LoRA-GA outperforms LoRA by 5.69% on average. On larger models such as Llama 2-7B, LoRA-GA shows performance improvements of 0.34, 11.52%, and 5.05% on MT-bench, GSM8K, and Human-eval, respectively. Additionally, we observe up to 2-4 times convergence speed improvement compared to vanilla LoRA, validating its effectiveness in accelerating convergence and enhancing model performance. Code is available at https://github.com/Outsider565/LoRA-GA.
Related papers
- Unlocking the Global Synergies in Low-Rank Adapters [20.32980343066711]
Low-rank Adaption (LoRA) has been the de-facto parameter-efficient fine-tuning technique for large language models.
We present HeteroLoRA, a light-weight search algorithm that leverages zero-cost proxies to allocate the limited LoRA trainable parameters.
Experiments show that HeteroLoRA enables improvements in model performance given the same parameter budge.
arXiv Detail & Related papers (2024-06-21T08:10:03Z) - ResLoRA: Identity Residual Mapping in Low-Rank Adaption [96.59370314485074]
We propose ResLoRA, an improved framework of low-rank adaptation (LoRA)
Our method can achieve better results in fewer training steps without any extra trainable parameters or inference cost compared to LoRA.
The experiments on NLG, NLU, and text-to-image tasks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-02-28T04:33:20Z) - PeriodicLoRA: Breaking the Low-Rank Bottleneck in LoRA Optimization [39.30090456724925]
Supervised fine-tuning is the most common method to adapt large language models (LLMs) to downstream tasks.
Full fine-tuning requires massive computational resources.
LoRA is one of the most widely used methods, which assumes that the optimization process is essentially low-dimensional.
arXiv Detail & Related papers (2024-02-25T16:43:41Z) - LoRA+: Efficient Low Rank Adaptation of Large Models [13.074320303580361]
We show that Low Rank Adaptation (LoRA) leads to suboptimal finetuning of models with large width (embedding dimension)
We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio.
In our experiments, LoRA$+$ improves performance (1-2 $%$ improvements) and finetuning speed (up to $sim$ 2X SpeedUp) at the same computational cost as LoRA.
arXiv Detail & Related papers (2024-02-19T18:33:49Z) - DoRA: Weight-Decomposed Low-Rank Adaptation [57.68678247436207]
We introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA.
Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA)
DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning.
arXiv Detail & Related papers (2024-02-14T17:59:34Z) - Chain of LoRA: Efficient Fine-tuning of Language Models via Residual
Learning [31.036465632204663]
We introduce Chain of LoRA, an iterative optimization framework inspired by the Frank-Wolfe algorithm.
We demonstrate that COLA can consistently outperform LoRA without additional computational or memory costs.
arXiv Detail & Related papers (2024-01-08T14:26:49Z) - Run LoRA Run: Faster and Lighter LoRA Implementations [50.347242693025336]
LoRA is a technique that reduces the number of trainable parameters in a neural network by introducing low-rank adapters to linear layers.
This paper presents the RunLoRA framework for efficient implementations of LoRA.
Experiments show up to 28% speedup on language modeling networks.
arXiv Detail & Related papers (2023-12-06T10:54:34Z) - Sparse Low-rank Adaptation of Pre-trained Language Models [79.74094517030035]
We introduce sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process.
Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters.
Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.
arXiv Detail & Related papers (2023-11-20T11:56:25Z) - LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models
Fine-tuning [19.08716369943138]
We present LoRA-FA, a memory-efficient fine-tuning method that reduces the activation memory without performance degradation and expensive recomputation.
Our results show that LoRA-FA can always achieve close fine-tuning accuracy across different tasks compared to full parameter fine-tuning and LoRA.
arXiv Detail & Related papers (2023-08-07T05:12:27Z) - LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs)
LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner.
LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z) - LoRA: Low-Rank Adaptation of Large Language Models [71.75808607987281]
Low-Rank Adaptation, or LoRA, freezes the pre-trained model weights and injects trainable rank decomposition into each layer of the Transformer architecture.
For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning.
arXiv Detail & Related papers (2021-06-17T17:37:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.