LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts
- URL: http://arxiv.org/abs/2509.25684v1
- Date: Tue, 30 Sep 2025 02:38:10 GMT
- Title: LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts
- Authors: Yuan Zhuang, Yi Shen, Yuexin Bian, Qing Su, Shihao Ji, Yuanyuan Shi, Fei Miao,
- Abstract summary: We propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts.<n>Our design allows the model to adaptively determine the number of experts to activate for each token at different layers.<n>Our method achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.
- Score: 24.0422448103907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.
Related papers
- Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training [30.589225478300023]
DTop-p is a sparsity-controllable dynamic Top-p routing mechanism.<n>We show that DTop-p consistently outperforms both Top-k and fixed-threshold Top-p baselines.<n>DTop-p exhibits strong scaling properties with respect to expert granularity, expert capacity, model size, and dataset size.
arXiv Detail & Related papers (2025-12-16T01:28:57Z) - Hierarchical LoRA MoE for Efficient CTR Model Scaling [56.608809143548946]
HiLoMoE is a hierarchical LoRA MoE framework that enables holistic scaling in a parameter-efficient manner.<n>Unlike conventional stacking, HiLoMoE routes based on prior layer scores rather than outputs, allowing all layers to execute in parallel.
arXiv Detail & Related papers (2025-10-12T03:54:11Z) - MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models [52.876185634349575]
We propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to Large Vision-Language Models (LVLMs)<n>For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts.<n>Our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models.
arXiv Detail & Related papers (2025-08-13T13:00:05Z) - ExpertSteer: Intervening in LLMs through Expert Knowledge [86.98098988779809]
Activation steering offers a promising method to control the generation process of Large Language Models.<n>We propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors.<n>We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains.
arXiv Detail & Related papers (2025-05-18T08:55:46Z) - DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism [5.988126768890861]
DynMoLE is a hybrid routing strategy that dynamically adjusts expert selection based on the Tsallis entropy of the router's probability distribution.<n>Our experiments on commonsense reasoning benchmarks demonstrate that DynMoLE achieves substantial performance improvements.
arXiv Detail & Related papers (2025-04-01T11:14:19Z) - Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning [76.10639521319382]
We propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework.<n>We show Symbolic-MoE beats strong LLMs like GPT4o-mini, as well as multi-agent approaches, with an absolute avg. gain of 8.15% over the best multi-agent baseline.
arXiv Detail & Related papers (2025-03-07T18:03:13Z) - OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning [3.8813502422318127]
Building mixture-of-experts (MoE) architecture for Low-rank adaptation (LoRA) is emerging as a potential direction in parameter-efficient fine-tuning (PEFT)<n>We first conduct qualitative analysis to indicate that experts collapse to similar representations in vanilla MoE, limiting the capacity of modular design and computational efficiency.<n>Motivated by these findings, we propose Orthogonal Mixture-of-Experts (OMoE)<n>Our method is simple and alleviates memory bottlenecks, as it incurs minimal experts compared to vanilla MoE models.
arXiv Detail & Related papers (2025-01-17T09:27:08Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.<n>We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.<n>The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts [0.0]
We introduce AdaMoLE, a novel method for fine-tuning large language models (LLMs) through an Adaptive Mixture of Low-Rank Adaptation Experts.
AdaMoLE dynamically adjusts the activation threshold using a dedicated threshold network, adaptively responding to the varying complexities of different tasks.
arXiv Detail & Related papers (2024-05-01T07:33:43Z) - Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models.
Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z) - Higher Layers Need More LoRA Experts [23.72297945365351]
We introduce a novel parameter-efficient MoE method, textittextbfMoE-LtextbfoRA with textbfLayer-wise Expert textbfAllocation (MoLA) for Transformer-based models.
Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines.
arXiv Detail & Related papers (2024-02-13T16:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.