Higher Layers Need More LoRA Experts
- URL: http://arxiv.org/abs/2402.08562v1
- Date: Tue, 13 Feb 2024 16:04:21 GMT
- Title: Higher Layers Need More LoRA Experts
- Authors: Chongyang Gao and Kezhen Chen and Jinmeng Rao and Baochen Sun and
Ruibo Liu and Daiyi Peng and Yawen Zhang and Xiaoyuan Guo and Jie Yang and VS
Subrahmanian
- Abstract summary: We introduce a novel parameter-efficient MoE method, textittextbfMoE-LtextbfoRA with textbfLayer-wise Expert textbfAllocation (MoLA) for Transformer-based models.
Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines.
- Score: 23.72297945365351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Parameter-efficient tuning (PEFT) techniques like low-rank adaptation (LoRA)
offer training efficiency on Large Language Models, but their impact on model
performance remains limited. Recent efforts integrate LoRA and
Mixture-of-Experts (MoE) to improve the performance of PEFT methods. Despite
promising results, research on improving the efficiency of LoRA with MoE is
still in its early stages. Recent studies have shown that experts in the MoE
architecture have different strengths and also exhibit some redundancy. Does
this statement also apply to parameter-efficient MoE? In this paper, we
introduce a novel parameter-efficient MoE method,
\textit{\textbf{M}oE-L\textbf{o}RA with \textbf{L}ayer-wise Expert
\textbf{A}llocation (MoLA)} for Transformer-based models, where each model
layer has the flexibility to employ a varying number of LoRA experts. We
investigate several architectures with varying layer-wise expert
configurations. Experiments on six well-known NLP and commonsense QA benchmarks
demonstrate that MoLA achieves equal or superior performance compared to all
baselines. We find that allocating more LoRA experts to higher layers further
enhances the effectiveness of models with a certain number of experts in total.
With much fewer parameters, this allocation strategy outperforms the setting
with the same number of experts in every layer. This work can be widely used as
a plug-and-play parameter-efficient tuning approach for various applications.
The code is available at https://github.com/GCYZSL/MoLA.
Related papers
- Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs [75.11449420928139]
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks.
Low-Rank Adaptation (LoRA) has emerged as a promising solution, but there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum.
We propose eXtreme Gradient Boosting LoRA, a novel framework that bridges this gap by leveraging the power of ensemble learning.
arXiv Detail & Related papers (2024-10-25T17:07:13Z) - MiLoRA: Efficient Mixture of Low-Rank Adaptation for Large Language Models Fine-tuning [9.91790333647256]
Low-rank adaptation (LoRA) and its mixture-of-experts (MOE) variants are highly effective parameter-efficient fine-tuning (PEFT) methods.
We propose Mixture of Low-Rank Adaptation (MiLoRA), a novel and efficient LoRA variant.
MiLoRA differs from previous MOE-style LoRA methods by considering each LoRA module as an expert and employing a prompt-aware routing mechanism.
arXiv Detail & Related papers (2024-10-23T17:04:40Z) - AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality [31.830108790753172]
Low-Rank Adaptation (LoRA) is known to enhance training efficiency in Large Language Models (LLMs)
Recent studies seek to combine LoRA with Mixture-of-Experts (MoE) to boost performance across various tasks.
We introduce AlphaLoRA, a theoretically principled and training-free method for allocating LoRA experts to further redundancy.
arXiv Detail & Related papers (2024-10-14T00:43:02Z) - BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts [41.83123857437985]
Training MoEs from scratch in a large-scale regime is prohibitively expensive.
We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming.
Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance.
arXiv Detail & Related papers (2024-08-15T17:19:12Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.
We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.
The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models [24.915387910764082]
Expert-Specialized Fine-Tuning, or ESFT, tunes the experts most relevant to downstream tasks while freezing the other experts and modules.
MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks.
arXiv Detail & Related papers (2024-07-02T03:11:13Z) - Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training [73.90260246781435]
We present Lory, the first approach that scales such architectures to autoregressive language model pre-training.
We show significant performance gains over parameter-matched dense models on both perplexity and a variety of downstream tasks.
Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing.
arXiv Detail & Related papers (2024-05-06T03:06:33Z) - Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes.
This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z) - MoELoRA: Contrastive Learning Guided Mixture of Experts on
Parameter-Efficient Fine-Tuning for Large Language Models [24.17147521556083]
We introduce a novel PEFT method: MoELoRA.
We conduct experiments on 11 tasks in math reasoning and common-sense reasoning benchmarks.
MoELoRA achieved an average performance that was 4.2% higher than LoRA, and demonstrated competitive performance compared to the 175B GPT-3.5 on several benchmarks.
arXiv Detail & Related papers (2024-02-20T09:30:48Z) - Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts [74.40198929049959]
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks.
generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks.
We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to mix many multimodal low rank experts.
arXiv Detail & Related papers (2023-12-01T23:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.