Sparse High Rank Adapters
- URL: http://arxiv.org/abs/2406.13175v1
- Date: Wed, 19 Jun 2024 03:13:11 GMT
- Title: Sparse High Rank Adapters
- Authors: Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Rafael Esteves, Shreya Kadambi, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, Markus Nagel,
- Abstract summary: Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research.
We propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss.
- Score: 16.160749645651567
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30% higher) inference latency while enabling rapid switching in the unfused mode. LoRA also exhibits concept-loss when multiple adapters are used concurrently. In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged. This results in a highly sparse adapter which can be switched directly in the fused mode. We further provide theoretical and empirical insights on how high sparsity in SHiRA can aid multi-adapter fusion by reducing concept loss. Our extensive experiments on LVMs and LLMs demonstrate that finetuning only a small fraction of the parameters in the base model is sufficient for many tasks while enabling both rapid switching and multi-adapter fusion. Finally, we provide a latency- and memory-efficient SHiRA implementation based on Parameter-Efficient Finetuning (PEFT) Library. This implementation trains at nearly the same speed as LoRA while consuming lower peak GPU memory, thus making SHiRA easy to adopt for practical use cases.
Related papers
- ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition [3.0212805640687836]
We propose ID-LoRA, a novel PEFT framework that breaks the trade-off.<n>We evaluate ID-LoRA on five diverse benchmarks: Mathematical Reasoning, Code Generation, MMLU, CommonsenseQA, and Safety Alignment.
arXiv Detail & Related papers (2026-02-24T09:45:10Z) - FLoRA: Fused forward-backward adapters for parameter efficient fine-tuning and reducing inference-time latencies of LLMs [7.771813594229729]
We propose a family of fused forward-backward adapters (FFBA) for parameter-efficient fine-tuning of large language models (LLMs) on downstream tasks.<n> Experimental results show that the proposed FFB adapters perform significantly better than the popularly used LoRA in both accuracy and latency.
arXiv Detail & Related papers (2025-10-28T12:45:45Z) - Faster Than SVD, Smarter Than SGD: The OPLoRA Alternating Update [50.36542772932594]
Low-Rank Adaptation (LoRA) fine-tunes large models by learning low-rank updates on top of frozen weights.<n>There is still a gap between full training with low-rank projections (SVDLoRA) and LoRA fine-tuning, indicating that LoRA steps can be further improved.
arXiv Detail & Related papers (2025-09-24T10:32:50Z) - SRLoRA: Subspace Recomposition in Low-Rank Adaptation via Importance-Based Fusion and Reinitialization [2.594346658179846]
Low-Rank Adaptation (LoRA) constrains updates to a fixed low-rank subspace.<n>We introduce Subspace Recomposition in Low-Rank Adaptation (SRLoRA) via importance-based fusion and reinitialization.<n> SRLoRA consistently achieves faster convergence and improved accuracy over standard LoRA.
arXiv Detail & Related papers (2025-05-18T14:12:40Z) - Activated LoRA: Fine-tuned LLMs for Intrinsics [9.503174205896533]
Low-Rank Adaptation (LoRA) has emerged as a highly efficient framework for finetuning the weights of large foundation models.
We propose Activated LoRA (aLoRA), which modifies the LoRA framework to only adapt weights for the tokens in the sequence emphafter the aLoRA is invoked.
This change crucially allows aLoRA to accept the base model's KV cache of the input string, meaning that aLoRA can be instantly activated whenever needed in a chain.
arXiv Detail & Related papers (2025-04-16T18:03:21Z) - Replay-Free Continual Low-Rank Adaptation with Dynamic Memory [62.85596937435928]
We revisit continual learning, which enables pre-trained vision transformers (ViTs) to sequentially fine-tune on new downstream tasks over time.<n>Recent studies highlight a crossover between CL techniques and parameter-efficient fine-tuning.<n>We propose a novel PEFT-CL method called Dual Low-Rank Adaptation (DualLoRA)
arXiv Detail & Related papers (2024-11-01T14:28:39Z) - LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization [78.93425154518705]
Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements.
This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization.
arXiv Detail & Related papers (2024-10-27T22:57:12Z) - SLIM: Let LLM Learn More and Forget Less with Soft LoRA and Identity Mixture [7.543093479330315]
Training the whole model for downstream tasks is expensive, and could easily result in catastrophic forgetting.
We propose a novel mixture of expert (MoE) framework based on Soft LoRA and Identity Mixture (SLIM)
SLIM allows dynamic routing between LoRA adapters and skipping connection, enables the suppression of forgetting.
arXiv Detail & Related papers (2024-10-10T09:16:05Z) - Rapid Switching and Multi-Adapter Fusion via Sparse High Rank Adapters [16.160749645651567]
We propose Sparse High Rank Adapters (SHiRA) that directly finetune 1-2% of the base model weights while leaving others unchanged.
This high sparsity incurs no inference overhead, enables rapid switching directly in the fused mode, and significantly reduces concept-loss during multi-adapter fusion.
arXiv Detail & Related papers (2024-07-22T22:46:36Z) - SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information [3.6859322366469933]
SwitchLoRA is a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA with alternative parameters.
Our results demonstrate that SwitchLoRA actually surpasses full-rank training.
arXiv Detail & Related papers (2024-06-03T05:40:34Z) - ResLoRA: Identity Residual Mapping in Low-Rank Adaption [96.59370314485074]
We propose ResLoRA, an improved framework of low-rank adaptation (LoRA)
Our method can achieve better results in fewer training steps without any extra trainable parameters or inference cost compared to LoRA.
The experiments on NLG, NLU, and text-to-image tasks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-02-28T04:33:20Z) - MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning [71.50432879573614]
Low-rank adaptation (LoRA) is based on the idea that the adaptation process is intrinsically low-dimensional.
We present MELoRA, a mini-ensemble low-rank adapters that uses fewer trainable parameters while maintaining a higher rank.
Our experimental results show that, compared to LoRA, MELoRA achieves better performance with 8 times fewer trainable parameters on natural language understanding tasks and 36 times fewer trainable parameters on instruction following tasks.
arXiv Detail & Related papers (2024-02-27T07:14:12Z) - LoRA+: Efficient Low Rank Adaptation of Large Models [13.074320303580361]
We show that Low Rank Adaptation (LoRA) leads to suboptimal finetuning of models with large width (embedding dimension)
We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio.
In our experiments, LoRA$+$ improves performance (1-2 $%$ improvements) and finetuning speed (up to $sim$ 2X SpeedUp) at the same computational cost as LoRA.
arXiv Detail & Related papers (2024-02-19T18:33:49Z) - DoRA: Weight-Decomposed Low-Rank Adaptation [57.68678247436207]
We introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA.
Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA)
DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning.
arXiv Detail & Related papers (2024-02-14T17:59:34Z) - Run LoRA Run: Faster and Lighter LoRA Implementations [50.347242693025336]
LoRA is a technique that reduces the number of trainable parameters in a neural network by introducing low-rank adapters to linear layers.
This paper presents the RunLoRA framework for efficient implementations of LoRA.
Experiments show up to 28% speedup on language modeling networks.
arXiv Detail & Related papers (2023-12-06T10:54:34Z) - A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA [0.7252027234425334]
A popular PEFT method is Low-Rank Adapters (LoRA), which adds trainable low-rank "adapters" to selected layers.
This scaling factor, which divides adapters by a factor of the rank, results in slowed learning and stunted performance for LoRA with higher-rank adapters.
We modify LoRA with the appropriate scaling factor, which easily provides for a fine-tuning compute/performance trade-off.
arXiv Detail & Related papers (2023-11-28T03:23:20Z) - Sparse Low-rank Adaptation of Pre-trained Language Models [79.74094517030035]
We introduce sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process.
Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters.
Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.
arXiv Detail & Related papers (2023-11-20T11:56:25Z) - S-LoRA: Serving Thousands of Concurrent LoRA Adapters [59.490751234925206]
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks.
We present S-LoRA, a system designed for the scalable serving of many LoRA adapters.
arXiv Detail & Related papers (2023-11-06T17:26:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.