Related papers: HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts

HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts

URL: http://arxiv.org/abs/2312.07035v1
Date: Tue, 12 Dec 2023 07:40:23 GMT
Title: HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts
Authors: Giang Do, Khiem Le, Quang Pham, TrungTin Nguyen, Thanh-Nam Doan, Bint T. Nguyen, Chenghao Liu, Savitha Ramasamy, Xiaoli Li, Steven Hoi
Abstract summary: This work introduces HyperRout, which dynamically generates the router's parameters through a fixed hypernetwork and trainable embeddings. Experiments across a wide range of tasks demonstrate the superior performance and efficiency gains of HyperRout.
Score: 34.08858035082419
License: http://creativecommons.org/licenses/by/4.0/
Abstract: By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn similar representations. However, this strategy has two key limitations: (i) the policy derived from random routers might be sub-optimal, and (ii) it requires extensive resources during training and evaluation, leading to limited efficiency gains. This work introduces \HyperRout, which dynamically generates the router's parameters through a fixed hypernetwork and trainable embeddings to achieve a balance between training the routers and freezing them to learn an improved routing policy. Extensive experiments across a wide range of tasks demonstrate the superior performance and efficiency gains of \HyperRouter compared to existing routing methods. Our implementation is publicly available at {\url{{https://github.com/giangdip2410/HyperRouter}}}.

Related papers

Trade-offs in Ensembling, Merging and Routing Among Parameter-Efficient Experts [56.02203242609604]
Large language models (LLMs) fine-tuned with lightweight adapters achieve strong performance across diverse tasks.<n>Fusing independently trained models with different strengths has shown promise for multi-task learning through three main strategies.<n>We empirically evaluate their trade-offs, addressing two key questions: What are the advantages of going beyond uniform ensembling or merging? and does the flexibility of routing justify its complexity?
arXiv Detail & Related papers (2026-03-03T21:44:11Z)
SkillOrchestra: Learning to Route Agents via Skill Transfer [65.50924963973286]
We introduce SkillOrchestra, a framework for skill-aware orchestration.<n>SkillOrchestra learns fine-grained skills from execution experience and models agent-specific competence and cost under those skills.<n>At deployment, the orchestrator infers the skill demands of the current interaction and selects agents that best satisfy them under an explicit performance-cost trade-off.
arXiv Detail & Related papers (2026-02-23T10:17:25Z)
When Routing Collapses: On the Degenerate Convergence of LLM Routers [46.01380774114097]
As user's cost budget increases, routers systematically default to the most capable and most expensive model.<n>We propose Equi, a decision-aware router that directly learns model rankings.<n>On RouterBench, Equi reduces cost by about 17% at GPT-4-level performance compared to the strongest prior router.
arXiv Detail & Related papers (2026-02-03T12:51:55Z)
TCAndon-Router: Adaptive Reasoning Router for Multi-Agent Collaboration [0.9564467981235256]
Multi-Agent Systems (MAS) have become a powerful paradigm for building high performance intelligent applications.<n>Within these systems, the router responsible for determining which expert agents should handle a given query plays a crucial role in overall performance.<n>To address these challenges, we propose TCAndon-TCAR: an adaptive reasoning router for multi-agent collaboration.<n>Experiments on public datasets and real enterprise data demonstrate that TCAR significantly improves routing accuracy, reduces routing conflicts, and remains robust in ambiguous scenarios.
arXiv Detail & Related papers (2026-01-08T03:17:33Z)
ProxRouter: Proximity-Weighted LLM Query Routing for Improved Robustness to Outliers [14.831117443453165]
Large language model (LLM) query routers are critical to modern AI platforms.<n>We propose Prox, which applies an exponentially tilted aggregation mechanism to balance bias and variance in nonparametric routers.
arXiv Detail & Related papers (2025-10-10T20:28:14Z)
Router Upcycling: Leveraging Mixture-of-Routers in Mixture-of-Experts Upcycling [26.191204823414427]
We propose a novel routing technique called Router Upcycling to enhance the performance of MoE upcycling models.<n>Our method achieves state-of-the-art (SOTA) performance, outperforming other upcycling baselines.
arXiv Detail & Related papers (2025-08-31T03:22:54Z)
Load Balancing Mixture of Experts with Similarity Preserving Routers [37.348178220494226]
Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks.<n>We introduce a novel load balancing loss that preserves token-wise relational structure.<n>Our results show that applying our loss to the router results in 36% faster convergence and lower redundancy.
arXiv Detail & Related papers (2025-06-16T22:22:59Z)
Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers [40.40923861822689]
Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers. Despite its promise, current MoD approaches remain under-explored and face two main challenges. We propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we propose MindSkip, which deploys textitAttention with Dynamic Depths. This method preserves the model's performance while significantly enhancing computational and memory efficiency.
arXiv Detail & Related papers (2024-10-17T03:23:50Z)
Learning Sub-Second Routing Optimization in Computer Networks requires Packet-Level Dynamics [15.018408728324887]
Reinforcement Learning can help to learn network representations that provide routing decisions. We present $textitPackeRL$, the first packet-level Reinforcement Learning environment for routing in generic network topologies. We also introduce two new algorithms for learning sub-second Routing Optimization.
arXiv Detail & Related papers (2024-10-14T11:03:46Z)
RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models [24.113223576205932]
We show that query-based Router by Dual Contrastive learning (DC) is effective in assembling large language models (LLMs) DC is effective in assembling LLMs and largely outperforms individual top-performing LLMs as well as existing routing methods on both in-distribution and out-of-distribution tasks.
arXiv Detail & Related papers (2024-09-30T02:31:40Z)
XRoute Environment: A Novel Reinforcement Learning Environment for Routing [8.797544401458476]
We introduce the XRoute Environment, a new reinforcement learning environment. Agents are trained to select and route nets in an advanced, end-to-end routing framework. The resulting environment is challenging, easy to use, customize and add additional scenarios.
arXiv Detail & Related papers (2023-05-23T08:46:25Z)
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing [47.11171833082974]
We introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing. Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
arXiv Detail & Related papers (2022-12-10T03:44:16Z)
Multi-Head Adapter Routing for Cross-Task Generalization [56.75667096355806]
Polytropon learns an inventory of adapters and a routing function that selects a subset of adapters for each task during both pre-training and few-shot adaptation. We find that routing is most beneficial during multi-task pre-training rather than during few-shot adaptation.
arXiv Detail & Related papers (2022-11-07T19:35:55Z)
On the Representation Collapse of Sparse Mixture of Experts [102.83396489230375]
Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse.
arXiv Detail & Related papers (2022-04-20T01:40:19Z)
StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead. We propose StableMoE with two training stages to address the routing fluctuation problem. Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z)
Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search [60.965024145243596]
One-shot weight sharing methods have recently drawn great attention in neural architecture search due to high efficiency and competitive performance. To alleviate this problem, we present a simple yet effective architecture distillation method. We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training. Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop.
arXiv Detail & Related papers (2020-10-29T17:55:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.