Maximum Score Routing For Mixture-of-Experts
- URL: http://arxiv.org/abs/2508.12801v1
- Date: Mon, 18 Aug 2025 10:25:42 GMT
- Title: Maximum Score Routing For Mixture-of-Experts
- Authors: Bowen Dong, Yilong Fan, Yutao Sun, Zhenyu Li, Tengyu Pan, Xun Zhou, Jianyong Wang,
- Abstract summary: MaxScore is a novel MoE routing paradigm that models routing as a minimum-cost maximum-flow problem and integrates a SoftTopk operator.<n>MaxScore achieves lower training losses and higher evaluation scores at equivalent FLOPs compared to both constrained and unconstrained baselines.
- Score: 22.063226685681034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Routing networks in sparsely activated mixture-of-experts (MoE) dynamically allocate input tokens to top-k experts through differentiable sparse transformations, enabling scalable model capacity while preserving computational efficiency. Traditional MoE networks impose an expert capacity constraint to ensure GPU-friendly computation. However, this leads to token dropping when capacity is saturated and results in low hardware efficiency due to padding in underutilized experts. Removing the capacity constraint, in turn, compromises load balancing and computational efficiency. To address these issues, we propose Maximum Score Routing ($\mathbf{MaxScore}$), a novel MoE routing paradigm that models routing as a minimum-cost maximum-flow problem and integrates a SoftTopk operator. MaxScore resolves the fundamental limitations of iterative rerouting and optimal transport formulations, achieving lower training losses and higher evaluation scores at equivalent FLOPs compared to both constrained and unconstrained baselines. Implementation details and experimental configurations can be obtained from $\href{https://github.com/dongbw18/MaxScore.git}{MaxScore}$.
Related papers
- MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - Cutting Slack: Quantum Optimization with Slack-Free Methods for Combinatorial Benchmarks [4.266376725904727]
Constraint handling remains a key bottleneck in quantum optimization.<n>We investigate a suite of Lagrangian-based optimization techniques for solving constrained problems on quantum simulators and hardware.<n>Our results highlight the flexibility of Lagrangian formulations as a scalable alternative to QUBO penalization.
arXiv Detail & Related papers (2025-07-16T11:39:47Z) - D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving [14.607254882119507]
Combination of experts (MoE) model is a sparse variant of large language models (LLMs)<n>Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices.<n>We propose D$2$MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert.
arXiv Detail & Related papers (2025-04-17T05:37:35Z) - QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs.<n>Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost.<n>We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z) - Accelerating MoE Model Inference with Expert Sharding [1.4733737463429546]
Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead.<n>We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts.
arXiv Detail & Related papers (2025-03-11T14:15:01Z) - DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.<n>We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.<n>Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z) - MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing [0.6445605125467574]
Mixture-of-Experts (MoE) model architecture has emerged as a promising solution for scaling transformer models efficiently.<n>MoE models need to be distributed across GPU devices, thus face critical performance bottlenecks.<n>We propose an optimal expert-to- GPU assignment that minimizes token routing costs and token processing balances across devices.
arXiv Detail & Related papers (2025-02-10T16:34:36Z) - Sparse Backpropagation for MoE Training [118.31785160874024]
We introduce SparseMixer, a scalable gradient estimator that bridges the gap between backpropagation and sparse expert routing.
Grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a second-order ODE solver, to deliver precise gradient approximations.
Applying SparseMixer to Switch Transformer on both pre-training and machine translation tasks, SparseMixer showcases considerable performance gain.
arXiv Detail & Related papers (2023-10-01T22:43:57Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - Taming GANs with Lookahead-Minmax [63.90038365274479]
Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient.
Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels.
arXiv Detail & Related papers (2020-06-25T17:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.