SiRA: Sparse Mixture of Low Rank Adaptation
- URL: http://arxiv.org/abs/2311.09179v1
- Date: Wed, 15 Nov 2023 18:15:37 GMT
- Title: SiRA: Sparse Mixture of Low Rank Adaptation
- Authors: Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei
Shu, Han Lu, Canoee Liu, Liangchen Luo, Jindong Chen, Lei Meng
- Abstract summary: We investigate the importance of leveraging "sparse" computation and propose SiRA: sparse mixture of low rank.
Specifically it enforces the top $k$ experts routing with a capacity limit restricting the maximum number of tokens each expert can process.
- Score: 63.926732717719354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Parameter Efficient Tuning has been an prominent approach to adapt the Large
Language Model to downstream tasks. Most previous works considers adding the
dense trainable parameters, where all parameters are used to adapt certain
task. We found this less effective empirically using the example of LoRA that
introducing more trainable parameters does not help. Motivated by this we
investigate the importance of leveraging "sparse" computation and propose SiRA:
sparse mixture of low rank adaption. SiRA leverages the Sparse Mixture of
Expert(SMoE) to boost the performance of LoRA. Specifically it enforces the top
$k$ experts routing with a capacity limit restricting the maximum number of
tokens each expert can process. We propose a novel and simple expert dropout on
top of gating network to reduce the over-fitting issue. Through extensive
experiments, we verify SiRA performs better than LoRA and other mixture of
expert approaches across different single tasks and multitask settings.
Related papers
- MALoRA: Mixture of Asymmetric Low-Rank Adaptation for Enhanced Multi-Task Learning [29.957620178740186]
In multi-task scenarios, challenges such as training imbalance and the seesaw effect frequently emerge.
We propose Mixture of Asymmetric Low-Rank Adaptaion (MALoRA) as a flexible fine-tuning framework.
MALoRA reduces the number of trainable parameters by 30% to 48%, increases training speed by 1.2x, and matches the computational efficiency of single-task LoRA models.
arXiv Detail & Related papers (2024-10-30T07:53:52Z) - Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs [75.11449420928139]
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks.
Low-Rank Adaptation (LoRA) has emerged as a promising solution, but there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum.
We propose eXtreme Gradient Boosting LoRA, a novel framework that bridges this gap by leveraging the power of ensemble learning.
arXiv Detail & Related papers (2024-10-25T17:07:13Z) - MiLoRA: Efficient Mixture of Low-Rank Adaptation for Large Language Models Fine-tuning [9.91790333647256]
Low-rank adaptation (LoRA) and its mixture-of-experts (MOE) variants are highly effective parameter-efficient fine-tuning (PEFT) methods.
We propose Mixture of Low-Rank Adaptation (MiLoRA), a novel and efficient LoRA variant.
MiLoRA differs from previous MOE-style LoRA methods by considering each LoRA module as an expert and employing a prompt-aware routing mechanism.
arXiv Detail & Related papers (2024-10-23T17:04:40Z) - Randomized Asymmetric Chain of LoRA: The First Meaningful Theoretical Framework for Low-Rank Adaptation [58.288682735160585]
Low-Rank Adaptation (LoRA) is a popular technique for finetuning models.
LoRA often under performs when compared to full- parameter fine-tuning.
We present a framework that rigorously analyzes the adaptation rates of LoRA methods.
arXiv Detail & Related papers (2024-10-10T18:51:53Z) - DoRA: Enhancing Parameter-Efficient Fine-Tuning with Dynamic Rank Distribution [28.589498108609202]
Low-Rank Adaptation (LoRA) relies on a bypass framework that ignores the differential parameter budget requirements across weight matrices.
DoRA decomposes high-rank LoRA layers into structured single-rank components, allowing for dynamic pruning of parameter budget.
Experimental results demonstrate that DoRA can achieve competitive performance compared with LoRA and full model fine-tuning.
arXiv Detail & Related papers (2024-05-27T17:02:27Z) - ResLoRA: Identity Residual Mapping in Low-Rank Adaption [96.59370314485074]
We propose ResLoRA, an improved framework of low-rank adaptation (LoRA)
Our method can achieve better results in fewer training steps without any extra trainable parameters or inference cost compared to LoRA.
The experiments on NLG, NLU, and text-to-image tasks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-02-28T04:33:20Z) - MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning [71.50432879573614]
Low-rank adaptation (LoRA) is based on the idea that the adaptation process is intrinsically low-dimensional.
We present MELoRA, a mini-ensemble low-rank adapters that uses fewer trainable parameters while maintaining a higher rank.
Our experimental results show that, compared to LoRA, MELoRA achieves better performance with 8 times fewer trainable parameters on natural language understanding tasks and 36 times fewer trainable parameters on instruction following tasks.
arXiv Detail & Related papers (2024-02-27T07:14:12Z) - DoRA: Weight-Decomposed Low-Rank Adaptation [57.68678247436207]
We introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA.
Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA)
DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning.
arXiv Detail & Related papers (2024-02-14T17:59:34Z) - MultiLoRA: Democratizing LoRA for Better Multi-Task Learning [20.750808913757396]
LoRA achieves remarkable resource efficiency and comparable performance when adapting LLMs for specific tasks.
LoRA is dominated by a small number of top singular vectors while fine-tuning decomposes into a set of less important unitary transforms.
We propose MultiLoRA for better multi-task adaptation by reducing the dominance of top singular vectors observed in LoRA.
arXiv Detail & Related papers (2023-11-20T02:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.