Router Upcycling: Leveraging Mixture-of-Routers in Mixture-of-Experts Upcycling
- URL: http://arxiv.org/abs/2509.00679v1
- Date: Sun, 31 Aug 2025 03:22:54 GMT
- Title: Router Upcycling: Leveraging Mixture-of-Routers in Mixture-of-Experts Upcycling
- Authors: Junfeng Ran, Guangxiang Zhao, Yuhan Wu, Dawei Zhu, Longyun Wu, Yikai Zhao, Tong Yang, Lin Sun, Xiangzheng Zhang, Sujian Li,
- Abstract summary: We propose a novel routing technique called Router Upcycling to enhance the performance of MoE upcycling models.<n>Our method achieves state-of-the-art (SOTA) performance, outperforming other upcycling baselines.
- Score: 26.191204823414427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Mixture-of-Experts (MoE) models have gained significant attention in deep learning due to their dynamic resource allocation and superior performance across diverse tasks. However, efficiently training these models remains challenging. The MoE upcycling technique has been proposed to reuse and improve existing model components, thereby minimizing training overhead. Despite this, simple routers, such as linear routers, often struggle with complex routing tasks within MoE upcycling. In response, we propose a novel routing technique called Router Upcycling to enhance the performance of MoE upcycling models. Our approach initializes multiple routers from the attention heads of preceding attention layers during upcycling. These routers collaboratively assign tokens to specialized experts in an attention-like manner. Each token is processed into diverse queries and aligned with the experts' features (serving as keys). Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance, outperforming other upcycling baselines.
Related papers
- Trade-offs in Ensembling, Merging and Routing Among Parameter-Efficient Experts [56.02203242609604]
Large language models (LLMs) fine-tuned with lightweight adapters achieve strong performance across diverse tasks.<n>Fusing independently trained models with different strengths has shown promise for multi-task learning through three main strategies.<n>We empirically evaluate their trade-offs, addressing two key questions: What are the advantages of going beyond uniform ensembling or merging? and does the flexibility of routing justify its complexity?
arXiv Detail & Related papers (2026-03-03T21:44:11Z) - xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning [104.63494870852894]
We present x, a tool-calling-based routing system in which a learned router can either answer directly or invoke one or more external models.<n>Our implementation encompasses the full reinforcement learning framework, including reward and cost accounting.<n>Across diverse benchmarks, x achieves strong cost-performance trade-offs.
arXiv Detail & Related papers (2025-10-09T16:52:01Z) - Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts [11.437368205968573]
This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts.<n>We show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality.
arXiv Detail & Related papers (2025-10-08T16:40:31Z) - MoTE: Mixture of Task-specific Experts for Pre-Trained ModelBased Class-incremental Learning [39.892628170627496]
Class-incremental learning (CIL) requires deep learning models to continuously acquire new knowledge from streaming data.<n> prompt-based approaches suffer from prompt overwriting, while adapter-based methods face challenges such as dimensional misalignment between tasks.<n>We propose a mixture of task-specific experts (MoTE) framework that effectively mitigates the miscalibration caused by inconsistent output dimensions.
arXiv Detail & Related papers (2025-05-21T03:06:10Z) - Mixture of Routers [16.169900017745327]
We propose an efficient fine-tuning method called Mixture of Routers (MoR)<n>MoR uses multiple sub-routers for joint selection and uses a learnable main router to determine the weights of the sub-routers.<n>Results show that MoR outperforms baseline models on most tasks, achieving an average performance improvement of 1%.
arXiv Detail & Related papers (2025-03-30T08:39:09Z) - ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing [28.73697327316267]
Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget.<n>We propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing.<n>ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity.
arXiv Detail & Related papers (2024-12-19T10:21:20Z) - Glider: Global and Local Instruction-Driven Expert Router [83.785832410832]
"Model MoErging" methods prioritize generalization to unseen tasks at the expense of performance on held-in tasks.
We propose Global and Local Instruction Driven Expert Router (GLIDER) that integrates a multi-scale routing mechanism.
GLIDER achieves substantially improved held-in performance while maintaining strong generalization on held-out tasks.
arXiv Detail & Related papers (2024-10-09T17:59:14Z) - RouterRetriever: Routing over a Mixture of Expert Embedding Models [58.987116118425995]
We introduce RouterRetriever, a retrieval model that leverages a mixture of domain-specific experts by using a routing mechanism.<n> RouterRetriever is the first work to demonstrate the advantages of routing over a mixture of domain-specific expert embedding models.
arXiv Detail & Related papers (2024-09-04T13:16:55Z) - MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts [38.15244333975921]
MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training.
Our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance.
arXiv Detail & Related papers (2024-07-13T09:22:33Z) - StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead.
We propose StableMoE with two training stages to address the routing fluctuation problem.
Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z) - Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural
Architecture Search [60.965024145243596]
One-shot weight sharing methods have recently drawn great attention in neural architecture search due to high efficiency and competitive performance.
To alleviate this problem, we present a simple yet effective architecture distillation method.
We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training.
Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop.
arXiv Detail & Related papers (2020-10-29T17:55:05Z) - Forgetful Experience Replay in Hierarchical Reinforcement Learning from
Demonstrations [55.41644538483948]
In this paper, we propose a combination of approaches that allow the agent to use low-quality demonstrations in complex vision-based environments.
Our proposed goal-oriented structuring of replay buffer allows the agent to automatically highlight sub-goals for solving complex hierarchical tasks in demonstrations.
The solution based on our algorithm beats all the solutions for the famous MineRL competition and allows the agent to mine a diamond in the Minecraft environment.
arXiv Detail & Related papers (2020-06-17T15:38:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.