Related papers: Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts

Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts

URL: http://arxiv.org/abs/2510.07205v1
Date: Wed, 08 Oct 2025 16:40:31 GMT
Title: Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts
Authors: Fangshuo Liao, Anastasios Kyrillidis,
Abstract summary: This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts.<n>We show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality.
Score: 11.437368205968573
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts (MoE) architectures have emerged as a cornerstone of modern AI systems. In particular, MoEs route inputs dynamically to specialized experts whose outputs are aggregated through weighted summation. Despite their widespread application, theoretical understanding of MoE training dynamics remains limited to either separate expert-router optimization or only top-1 routing scenarios with carefully constructed datasets. This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts in a student-teacher framework. We prove that, with moderate over-parameterization, the student network undergoes a feature learning phase, where the router's learning process is ``guided'' by the experts, that recovers the teacher's parameters. Moreover, we show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality. To our knowledge, our analysis is the first to bring novel insights in understanding the optimization landscape of the MoE architecture.

Related papers

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning [23.18318273534301]
A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network.<n>MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters.<n>We propose textbfPhase-Aware Mixture of Experts (PA-MoE).<n>It first features a lightweight emphphase router that learns latent phase boundaries directly from the RL objective without pre-defining phase categories.<n>Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise
arXiv Detail & Related papers (2026-02-19T03:18:30Z)
ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns [68.61814799047956]
Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation.<n>We introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations.
arXiv Detail & Related papers (2026-02-17T11:50:58Z)
SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning [83.66308307152808]
We propose StAbilized Mixture-of-Experts (SAME) for Multimodal Continual Instruction Tuning (MCIT)<n>SAME stabilizes expert selection by decomposing routing dynamics into subspaces and updating only task-relevant directions.<n>It also introduces adaptive expert activation to freeze selected experts during training, reducing redundant and cross-task interference.
arXiv Detail & Related papers (2026-02-02T11:47:06Z)
ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts [25.46805026086543]
We describe ReXMoE, a novel MoE architecture that improves routing beyond the existing layer-local approaches.<n>ReXMoE decouples expert dimensionality from per-layer budgets, enabling richer expert combinations without sacrificing individual expert capacity.
arXiv Detail & Related papers (2025-10-20T12:27:55Z)
Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms [55.1784306456972]
Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference.<n>We use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors.<n>We uncover several findings: (1) neuron utilization decreases as models evolve, reflecting stronger generalization; (2) training exhibits a dynamic trajectory, where benchmark performance alone provides limited signal; (3) task completion emerges from collaborative contributions of multiple experts, with shared experts driving concentration; and (4) activation patterns at the neuron level provide a fine-grained proxy for data diversity.
arXiv Detail & Related papers (2025-09-28T15:13:38Z)
On Linear Mode Connectivity of Mixture-of-Experts Architectures [1.6747713135100666]
We investigate the phenomenon of linear Mode Connectivity (LMC) in neural networks.<n>LMC is a notable phenomenon in the loss landscapes of neural networks, wherein independently trained models have been to be connected--up to varying symmetries of an algorithm.
arXiv Detail & Related papers (2025-09-14T16:51:41Z)
Dynamic Acoustic Model Architecture Optimization in Training for ASR [51.21112094223223]
DMAO is an architecture optimization framework that employs a grow-and-drop strategy to automatically reallocate parameters during training.<n>We evaluate DMAO through experiments with CTC onSpeech, TED-LIUM-v2 and Switchboard datasets.
arXiv Detail & Related papers (2025-06-16T07:47:34Z)
A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z)
Soft Merging of Experts with Adaptive Routing [38.962451264172856]
We introduce Soft Merging of Experts with Adaptive Routing (SMEAR) SMEAR avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters. We empirically validate that models using SMEAR outperform models that route based on metadata or learn sparse routing through gradient estimation.
arXiv Detail & Related papers (2023-06-06T15:04:31Z)
Improving Expert Specialization in Mixture of Experts [0.7366405857677227]
Mixture of experts (MoE) is the simplest gated modular neural network architecture. We show that the original MoE architecture and its training method do not guarantee intuitive task decompositions and good expert utilization. We introduce a novel gating architecture, similar to attention, that improves performance and results in a lower entropy task decomposition.
arXiv Detail & Related papers (2023-02-28T16:16:45Z)
Towards Understanding Mixture of Experts in Deep Learning [95.27215939891511]
We study how the MoE layer improves the performance of neural network learning. Our results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of MoE.
arXiv Detail & Related papers (2022-08-04T17:59:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.