Related papers: Selective Sinkhorn Routing for Improved Sparse Mixture of Experts

Selective Sinkhorn Routing for Improved Sparse Mixture of Experts

URL: http://arxiv.org/abs/2511.08972v1
Date: Thu, 13 Nov 2025 01:22:34 GMT
Title: Selective Sinkhorn Routing for Improved Sparse Mixture of Experts
Authors: Duc Anh Nguyen, Huu Binh Ta, Nhuan Le Duc, Tan M. Nguyen, Toan Tran,
Abstract summary: Sparse Mixture-of-Experts (SMoE) has gained prominence as a scalable and computationally efficient architecture.<n>Existing SMoE models often rely on auxiliary losses and additional trainable parameters to encourage expert diversity.<n>We propose Selective Sinkhorn Routing (SSR), a routing mechanism that replaces auxiliary loss with lightweight Sinkhorn-based routing.
Score: 6.598611263174362
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse Mixture-of-Experts (SMoE) has gained prominence as a scalable and computationally efficient architecture, enabling significant growth in model capacity without incurring additional inference costs. However, existing SMoE models often rely on auxiliary losses (e.g., z-loss, load balancing) and additional trainable parameters (e.g., noisy gating) to encourage expert diversity, leading to objective misalignment and increased model complexity. Moreover, existing Sinkhorn-based methods suffer from significant training overhead due to their heavy reliance on the computationally expensive Sinkhorn algorithm. In this work, we formulate token-to-expert assignment as an optimal transport problem, incorporating constraints to ensure balanced expert utilization. We demonstrate that introducing a minimal degree of optimal transport-based routing enhances SMoE performance without requiring auxiliary balancing losses. Unlike previous methods, our approach derives gating scores directly from the transport map, enabling more effective token-to-expert balancing, supported by both theoretical analysis and empirical results. Building on these insights, we propose Selective Sinkhorn Routing (SSR), a routing mechanism that replaces auxiliary loss with lightweight Sinkhorn-based routing. SSR promotes balanced token assignments while preserving flexibility in expert selection. Across both language modeling and image classification tasks, SSR achieves faster training, higher accuracy, and greater robustness to input corruption.

Related papers

A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs [64.8510381475827]
Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently.<n>SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized.<n>We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set.
arXiv Detail & Related papers (2026-02-23T15:11:16Z)
Towards Efficient Constraint Handling in Neural Solvers for Routing Problems [53.35866378109893]
We present Construct-and-Refine, the first general and efficient constraint-handling framework for neural routing solvers.<n>CaR achieves superior feasibility, solution quality, and efficiency compared to both classical and neural state-of-the-art solvers.
arXiv Detail & Related papers (2026-02-17T21:06:23Z)
Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization [68.89915707647138]
Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains.<n>We propose textbfCoSMo (textbfSplit-textbfMerge textbfOptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume.
arXiv Detail & Related papers (2026-02-03T05:54:28Z)
Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts [74.40169987564724]
Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices.<n>Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures.<n>We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones.
arXiv Detail & Related papers (2026-01-23T18:19:15Z)
Enhanced-FQL($λ$), an Efficient and Interpretable RL with novel Fuzzy Eligibility Traces and Segmented Experience Replay [0.0]
Enhanced-FQL($$) achieves superior sample efficiency and reduced variance compared to n-step fuzzy TD and fuzzyA($$) baselines.<n>The framework's inherent interpretability, combined with its computational efficiency and theoretical convergence guarantees, makes it suitable for safety-critical applications.
arXiv Detail & Related papers (2026-01-07T20:59:18Z)
A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models [3.0247776995428945]
In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token.<n>We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure.
arXiv Detail & Related papers (2025-12-03T16:00:02Z)
ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization [13.182475975397251]
ERMoE is a sparse MoE transformer that replaces learned gating logits with an "Eigenbasis Score"<n>We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks.<n>A 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7% and yields interpretable expert specializations.
arXiv Detail & Related papers (2025-11-14T05:31:37Z)
Dual-granularity Sinkhorn Distillation for Enhanced Learning from Long-tailed Noisy Data [67.25796812343454]
Real-world datasets for deep learning frequently suffer from the co-occurring challenges of class imbalance and label noise.<n>We propose Dual-granularity Sinkhorn Distillation (D-SINK), a novel framework that enhances dual robustness by distilling and integrating complementary insights.<n>Experiments on benchmark datasets demonstrate that D-SINK significantly improves robustness and achieves strong empirical performance in learning from long-tailed noisy data.
arXiv Detail & Related papers (2025-10-09T13:05:27Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z)
Load Balancing Mixture of Experts with Similarity Preserving Routers [30.279616888339543]
Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks.<n>We introduce a novel load balancing loss that preserves token-wise relational structure.<n>Our results show that applying our loss to the router results in 36% faster convergence and lower redundancy.
arXiv Detail & Related papers (2025-06-16T22:22:59Z)
Advancing Expert Specialization for Better MoE [22.88847592702946]
Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input.<n>We observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing.<n>We propose a simple yet effective solution that introduces two complementary objectives.
arXiv Detail & Related papers (2025-05-28T13:09:47Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Efficient Single Image Super-Resolution with Entropy Attention and Receptive Field Augmentation [34.50541063621832]
We present an efficient single image super-resolution (SISR) model composed of a novel entropy attention (EA) and a shifting large kernel attention (SLKA) EA increases the entropy of intermediate features conditioned on a Gaussian distribution, providing more informative input for subsequent reasoning. SLKA extends the receptive field of SR models with the assistance of channel shifting, which also favors to boost the diversity of hierarchical features.
arXiv Detail & Related papers (2024-08-08T02:03:10Z)
End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures. We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.