Related papers: SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

URL: http://arxiv.org/abs/2406.06564v2
Date: Thu, 10 Oct 2024 07:03:59 GMT
Title: SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information
Authors: Kaiye Zhou, Shucheng Wang, Jun Xu,
Abstract summary: SwitchLoRA is a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA with alternative parameters. Our results demonstrate that SwitchLoRA actually surpasses full-rank training.
Score: 3.6859322366469933
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the training of large language models, parameter-efficient techniques such as LoRA optimize memory usage and reduce communication overhead during the fine-tuning phase. However, applying such techniques directly during the pre-training phase results in poor performance, primarily because the premature implementation of low-rank training significantly reduces model accuracy. Existing methods like ReLoRA and GaLore have attempted to address this challenge by updating the low-rank subspace. However, they still fall short of achieving the accuracy of full-rank training because they must limit the update frequency to maintain optimizer state consistency, hindering their ability to closely approximate full-rank training behavior. In this paper, we introduce SwitchLoRA, a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA adapters with alternative parameters. SwitchLoRA updates the low-rank subspace incrementally, targeting only a few dimensions at a time to minimize the impact on optimizer states. This allows a higher update frequency, thereby enhancing accuracy by enabling the updated parameters to more closely mimic full-rank behavior during the pre-training phase. Our results demonstrate that SwitchLoRA actually surpasses full-rank training, reducing perplexity from 15.23 to 15.01 on the LLaMA 1.3B model while reducing communication overhead by 54\% on the LLaMA 1.3B model. Furthermore, after full fine-tuning the SwitchLoRA pre-trained model and the full-rank pre-trained model on the GLUE benchmark, the SwitchLoRA pre-trained model showed an average accuracy gain of about 1\% over the full-rank pre-trained model. This demonstrates enhanced generalization and reasoning capabilities of SwitchLoRA.

Related papers

LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning [5.980897761790243]
We introduce LoFT, a novel low-rank adaptation method that behaves like full fine-tuning.<n>LoFT aligns the model's internal dynamics with those of updating all model weights.<n> Empirically, this approach substantially narrows the performance gap between adapter-based tuning and full fine-tuning.
arXiv Detail & Related papers (2025-05-27T14:54:24Z)
Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape [52.98187034726091]
Low-Rank Adaptation (LoRA) is an efficient way to fine-tune models by optimizing only a low-rank matrix. A solution that appears flat in the LoRA space may exist sharp directions in the full parameter space, potentially harming generalization performance. We propose Flat-LoRA, an efficient approach that seeks a low-rank adaptation located in a flat region of the full parameter space.
arXiv Detail & Related papers (2024-09-22T11:24:10Z)
PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation [9.445321300673909]
Low-rank adaption (LoRA) is a prominent method that adds a small number of learnable parameters to the frozen pre-trained weights for fine-tuning. In this work, we introduce Progressive Compression LoRA(PC-LoRA), which simultaneously perform model compression and fine-tuning.
arXiv Detail & Related papers (2024-06-13T13:44:31Z)
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states. In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy. Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z)
MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning [71.50432879573614]
Low-rank adaptation (LoRA) is based on the idea that the adaptation process is intrinsically low-dimensional. We present MELoRA, a mini-ensemble low-rank adapters that uses fewer trainable parameters while maintaining a higher rank. Our experimental results show that, compared to LoRA, MELoRA achieves better performance with 8 times fewer trainable parameters on natural language understanding tasks and 36 times fewer trainable parameters on instruction following tasks.
arXiv Detail & Related papers (2024-02-27T07:14:12Z)
DoRA: Weight-Decomposed Low-Rank Adaptation [57.68678247436207]
We introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA) DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning.
arXiv Detail & Related papers (2024-02-14T17:59:34Z)
Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models [45.72323731094864]
Low-Rank Adaptation (LoRA) emerges as a popular parameter-efficient fine-tuning (PEFT) method. In this work, we study the enhancement of LoRA training by introducing an $r times r$ preconditioner in each gradient step.
arXiv Detail & Related papers (2024-02-04T05:05:43Z)
Run LoRA Run: Faster and Lighter LoRA Implementations [50.347242693025336]
LoRA is a technique that reduces the number of trainable parameters in a neural network by introducing low-rank adapters to linear layers. This paper presents the RunLoRA framework for efficient implementations of LoRA. Experiments show up to 28% speedup on language modeling networks.
arXiv Detail & Related papers (2023-12-06T10:54:34Z)
Sparse Low-rank Adaptation of Pre-trained Language Models [79.74094517030035]
We introduce sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters. Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.
arXiv Detail & Related papers (2023-11-20T11:56:25Z)
ReLoRA: High-Rank Training Through Low-Rank Updates [14.606961537327345]
We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup.
arXiv Detail & Related papers (2023-07-11T18:02:09Z)
LoRA: Low-Rank Adaptation of Large Language Models [71.75808607987281]
Low-Rank Adaptation, or LoRA, freezes the pre-trained model weights and injects trainable rank decomposition into each layer of the Transformer architecture. For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning.
arXiv Detail & Related papers (2021-06-17T17:37:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.