Related papers: FURINA: Free from Unmergeable Router via LINear Aggregation of mixed experts

FURINA: Free from Unmergeable Router via LINear Aggregation of mixed experts

URL: http://arxiv.org/abs/2509.14900v2
Date: Thu, 25 Sep 2025 11:54:34 GMT
Title: FURINA: Free from Unmergeable Router via LINear Aggregation of mixed experts
Authors: Jiayi Han, Liang Du, Yinda Chen, Xiao Kang, Weiyang Ding, Donghong Han,
Abstract summary: Mixture of Experts (MoE) has been successfully integrated into Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning.<n>A key limitation of existing MoE-LoRA methods is their reliance on a discrete router.<n>We propose FURINA, a novel Free from Unmergeable Router framework based on the LINear Aggregation of experts.
Score: 17.056585698418587
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Mixture of Experts (MoE) paradigm has been successfully integrated into Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning (PEFT), delivering performance gains with minimal parameter overhead. However, a key limitation of existing MoE-LoRA methods is their reliance on a discrete router, which prevents the integration of the MoE components into the backbone model. To overcome this, we propose FURINA, a novel Free from Unmergeable Router framework based on the LINear Aggregation of experts. FURINA eliminates the router by introducing a Self-Routing mechanism. This is achieved through three core innovations: (1) decoupled learning of the direction and magnitude for LoRA adapters, (2) a shared learnable magnitude vector for consistent activation scaling, and (3) expert selection loss that encourages divergent expert activation. The proposed mechanism leverages the angular similarity between the input and each adapter's directional component to activate experts, which are then scaled by the shared magnitude vector. This design allows the output norm to naturally reflect the importance of each expert, thereby enabling dynamic, router-free routing. The expert selection loss further sharpens this behavior by encouraging sparsity and aligning it with standard MoE activation patterns. We also introduce a shared expert within the MoE-LoRA block that provides stable, foundational knowledge. To the best of our knowledge, FURINA is the first router-free, MoE-enhanced LoRA method that can be fully merged into the backbone model, introducing zero additional inference-time cost or complexity. Extensive experiments demonstrate that FURINA not only significantly outperforms standard LoRA but also matches or surpasses the performance of existing MoE-LoRA methods, while eliminating the extra inference-time overhead of MoE.

Related papers

CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging [49.87105462292961]
Core Space Mixture of LoRA (bfCoMoL) is a novel MoE-LoRA framework that incorporates expert diversity, parameter efficiency, and fine-grained adaptation.<n>CoMoL consistently outperforms existing methods across multiple tasks.
arXiv Detail & Related papers (2026-02-28T09:40:11Z)
L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts [49.90176890917986]
We propose Low-rank & Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry.<n>L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions.<n>Experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.
arXiv Detail & Related papers (2026-01-29T07:18:33Z)
RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering [62.63376387138257]
We propose a plug-and-play intervention framework that adaptively steers large language models (LLMs) reasoning in activation space.<n>RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input.<n>The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner.
arXiv Detail & Related papers (2026-01-14T08:04:33Z)
Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution [76.66229730098759]
In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models.<n>We propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution.<n>We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert.
arXiv Detail & Related papers (2025-11-20T04:11:44Z)
FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts [44.21416999726094]
Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models.<n>MoE-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning.<n>FlyLoRA is an implicit MoE-based LoRA variant that introduces rank-wise expert activation in the up-projection matrix.
arXiv Detail & Related papers (2025-10-09T16:17:13Z)
Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts [72.22148263683037]
We study the properties of sparse adapters, which train only a subset of weights in the base neural network, as potential building blocks of modular architectures.<n>First, we propose a simple method for training highly effective sparse adapters, which is conceptually simpler than existing methods in the literature.<n>Next, we investigate the merging properties of these sparse adapters by merging adapters for up to 20 natural language processing tasks.
arXiv Detail & Related papers (2025-07-09T03:25:45Z)
Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning [19.982853959240497]
Continual learning with large pre-trained models is challenged by catastrophic forgetting and task interference.<n>Existing LoRA-based Mixture-of-Experts (MoE) approaches mitigate forgetting by assigning and freezing task-specific adapters.<n>We propose MoRA, a Mixture-of-Rank Adaptive learning approach with self-activated and sparse rank activation for CL.
arXiv Detail & Related papers (2025-06-26T06:19:05Z)
LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing [17.171872354057694]
We propose the LoRA-Mixer, a modular and lightweight MoE framework that integrates LoRA experts.<n>Our core innovation lies in replacing the projection matrices of the attention module's input/output linear layers with task-specific LoRA experts.<n>LoRA-Mixer achieves significant improvements on datasets such as GSM8K, HumanEval, and MedQA.
arXiv Detail & Related papers (2025-06-17T14:58:54Z)
MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models [61.89384981175277]
We propose a emphheterogeneous textbfMixture-of-Adapters (MoA) approach to integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE)<n> Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency.
arXiv Detail & Related papers (2025-06-06T09:54:19Z)
DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism [5.988126768890861]
DynMoLE is a hybrid routing strategy that dynamically adjusts expert selection based on the Tsallis entropy of the router's probability distribution.<n>Our experiments on commonsense reasoning benchmarks demonstrate that DynMoLE achieves substantial performance improvements.
arXiv Detail & Related papers (2025-04-01T11:14:19Z)
Mixture of Routers [4.248666380057258]
We propose an efficient fine-tuning method called Mixture of Routers (MoR)<n>MoR uses multiple sub-routers for joint selection and uses a learnable main router to determine the weights of the sub-routers.<n>Results show that MoR outperforms baseline models on most tasks, achieving an average performance improvement of 1%.
arXiv Detail & Related papers (2025-03-30T08:39:09Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
LoRA-IR: Taming Low-Rank Experts for Efficient All-in-One Image Restoration [62.3751291442432]
We propose LoRA-IR, a flexible framework that dynamically leverages compact low-rank experts to facilitate efficient all-in-one image restoration. LoRA-IR consists of two training stages: degradation-guided pre-training and parameter-efficient fine-tuning. Experiments demonstrate that LoRA-IR achieves SOTA performance across 14 IR tasks and 29 benchmarks, while maintaining computational efficiency.
arXiv Detail & Related papers (2024-10-20T13:00:24Z)
Mixture of LoRA Experts [87.50120181861362]
This paper introduces the Mixture of LoRA Experts (MoLE) approach, which harnesses hierarchical control and unfettered branch selection. The MoLE approach achieves superior LoRA fusion performance in comparison to direct arithmetic merging.
arXiv Detail & Related papers (2024-04-21T11:59:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.