MoELoRA: Contrastive Learning Guided Mixture of Experts on
Parameter-Efficient Fine-Tuning for Large Language Models
- URL: http://arxiv.org/abs/2402.12851v1
- Date: Tue, 20 Feb 2024 09:30:48 GMT
- Title: MoELoRA: Contrastive Learning Guided Mixture of Experts on
Parameter-Efficient Fine-Tuning for Large Language Models
- Authors: Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao and
Kang Liu
- Abstract summary: We introduce a novel PEFT method: MoELoRA.
We conduct experiments on 11 tasks in math reasoning and common-sense reasoning benchmarks.
MoELoRA achieved an average performance that was 4.2% higher than LoRA, and demonstrated competitive performance compared to the 175B GPT-3.5 on several benchmarks.
- Score: 24.17147521556083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning is often necessary to enhance the adaptability of Large Language
Models (LLM) to downstream tasks. Nonetheless, the process of updating billions
of parameters demands significant computational resources and training time,
which poses a substantial obstacle to the widespread application of large-scale
models in various scenarios. To address this issue, Parameter-Efficient
Fine-Tuning (PEFT) has emerged as a prominent paradigm in recent research.
However, current PEFT approaches that employ a limited set of global parameters
(such as LoRA, which adds low-rank approximation matrices to all weights) face
challenges in flexibly combining different computational modules in downstream
tasks. In this work, we introduce a novel PEFT method: MoELoRA. We consider
LoRA as Mixture of Experts (MoE), and to mitigate the random routing phenomenon
observed in MoE, we propose the utilization of contrastive learning to
encourage experts to learn distinct features. We conducted experiments on 11
tasks in math reasoning and common-sense reasoning benchmarks. With the same
number of parameters, our approach outperforms LoRA significantly. In math
reasoning, MoELoRA achieved an average performance that was 4.2% higher than
LoRA, and demonstrated competitive performance compared to the 175B GPT-3.5 on
several benchmarks.
Related papers
- MALoRA: Mixture of Asymmetric Low-Rank Adaptation for Enhanced Multi-Task Learning [29.957620178740186]
In multi-task scenarios, challenges such as training imbalance and the seesaw effect frequently emerge.
We propose Mixture of Asymmetric Low-Rank Adaptaion (MALoRA) as a flexible fine-tuning framework.
MALoRA reduces the number of trainable parameters by 30% to 48%, increases training speed by 1.2x, and matches the computational efficiency of single-task LoRA models.
arXiv Detail & Related papers (2024-10-30T07:53:52Z) - Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs [75.11449420928139]
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks.
Low-Rank Adaptation (LoRA) has emerged as a promising solution, but there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum.
We propose eXtreme Gradient Boosting LoRA, a novel framework that bridges this gap by leveraging the power of ensemble learning.
arXiv Detail & Related papers (2024-10-25T17:07:13Z) - MTL-LoRA: Low-Rank Adaptation for Multi-Task Learning [74.43869839954168]
We propose MTL-LoRA, which retains the advantages of low-rank adaptation while significantly enhancing multi-task learning capabilities.
MTL-LoRA augments LoRA by incorporating additional task-adaptive parameters that differentiate task-specific information.
This approach enables large language models (LLMs) pre-trained on general corpus to adapt to different target task domains with a limited number of trainable parameters.
arXiv Detail & Related papers (2024-10-12T08:32:26Z) - Randomized Asymmetric Chain of LoRA: The First Meaningful Theoretical Framework for Low-Rank Adaptation [58.288682735160585]
Low-Rank Adaptation (LoRA) is a popular technique for finetuning models.
LoRA often under performs when compared to full- parameter fine-tuning.
We present a framework that rigorously analyzes the adaptation rates of LoRA methods.
arXiv Detail & Related papers (2024-10-10T18:51:53Z) - MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts [3.6301530893494127]
MixLoRA is an approach to construct a resource-efficient sparse MoE model based on LoRA.
Our evaluations show that MixLoRA improves about 9% accuracy compared to state-of-the-art PEFT methods in multi-task learning scenarios.
arXiv Detail & Related papers (2024-04-22T02:15:52Z) - Mixture of LoRA Experts [87.50120181861362]
This paper introduces the Mixture of LoRA Experts (MoLE) approach, which harnesses hierarchical control and unfettered branch selection.
The MoLE approach achieves superior LoRA fusion performance in comparison to direct arithmetic merging.
arXiv Detail & Related papers (2024-04-21T11:59:53Z) - LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning [31.088229461632206]
Large language models (LLMs) have become a significant roadblock to large-scale training.
Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem.
We investigate the layerwise properties of LoRA on fine-tuning tasks and observe an unexpected but consistent skewness of weight norms.
We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA.
arXiv Detail & Related papers (2024-03-26T17:55:02Z) - MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning [71.50432879573614]
Low-rank adaptation (LoRA) is based on the idea that the adaptation process is intrinsically low-dimensional.
We present MELoRA, a mini-ensemble low-rank adapters that uses fewer trainable parameters while maintaining a higher rank.
Our experimental results show that, compared to LoRA, MELoRA achieves better performance with 8 times fewer trainable parameters on natural language understanding tasks and 36 times fewer trainable parameters on instruction following tasks.
arXiv Detail & Related papers (2024-02-27T07:14:12Z) - Higher Layers Need More LoRA Experts [23.72297945365351]
We introduce a novel parameter-efficient MoE method, textittextbfMoE-LtextbfoRA with textbfLayer-wise Expert textbfAllocation (MoLA) for Transformer-based models.
Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines.
arXiv Detail & Related papers (2024-02-13T16:04:21Z) - Tied-Lora: Enhancing parameter efficiency of LoRA with weight tying [6.172790376076545]
We introduce Tied-LoRA, a novel paradigm leveraging weight tying and selective training to enhance the parameter efficiency of Low-rank Adaptation (LoRA)
Our exploration encompasses different plausible combinations of parameter training and freezing, coupled with weight tying, aimed at identifying the optimal trade-off between performance and the count of trainable parameters.
arXiv Detail & Related papers (2023-11-16T05:29:39Z) - LoRA: Low-Rank Adaptation of Large Language Models [71.75808607987281]
Low-Rank Adaptation, or LoRA, freezes the pre-trained model weights and injects trainable rank decomposition into each layer of the Transformer architecture.
For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning.
arXiv Detail & Related papers (2021-06-17T17:37:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.