Related papers: When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer

When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer

URL: http://arxiv.org/abs/2602.17144v1
Date: Thu, 19 Feb 2026 07:45:18 GMT
Title: When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer
Authors: Shuqi Liu, Yuzhou Cao, Lei Feng, Bo An, Luke Ong,
Abstract summary: We show that multi-expert L2D is fundamentally more challenging than the single-expert case.<n>We propose PiCCE, a surrogate-based method that adaptively identifies a reliable expert based on empirical evidence.
Score: 28.815942679585273
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning to Defer (L2D) enables a classifier to abstain from predictions and defer to an expert, and has recently been extended to multi-expert settings. In this work, we show that multi-expert L2D is fundamentally more challenging than the single-expert case. With multiple experts, the classifier's underfitting becomes inherent, which seriously degrades prediction performance, whereas in the single-expert setting it arises only under specific conditions. We theoretically reveal that this stems from an intrinsic expert identifiability issue: learning which expert to trust from a diverse pool, a problem absent in the single-expert case and renders existing underfitting remedies failed. To tackle this issue, we propose PiCCE (Pick the Confident and Correct Expert), a surrogate-based method that adaptively identifies a reliable expert based on empirical evidence. PiCCE effectively reduces multi-expert L2D to a single-expert-like learning problem, thereby resolving multi expert underfitting. We further prove its statistical consistency and ability to recover class probabilities and expert accuracies. Extensive experiments across diverse settings, including real-world expert scenarios, validate our theoretical results and demonstrate improved performance.

Related papers

SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning [83.66308307152808]
We propose StAbilized Mixture-of-Experts (SAME) for Multimodal Continual Instruction Tuning (MCIT)<n>SAME stabilizes expert selection by decomposing routing dynamics into subspaces and updating only task-relevant directions.<n>It also introduces adaptive expert activation to freeze selected experts during training, reducing redundant and cross-task interference.
arXiv Detail & Related papers (2026-02-02T11:47:06Z)
Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs [49.72591739116668]
Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs)<n>Existing methods address this issue by imitating expert trajectories, which improve effectiveness but neglect diversity.<n>We propose MENTOR: Mixed-policy Expert Navigation for Token-level Optimization of Reasoning.
arXiv Detail & Related papers (2025-10-05T10:38:55Z)
Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations [48.890534958441016]
This study investigates domain specialization and expert redundancy in large-scale MoE models.<n>We propose a simple yet effective pruning framework, EASY-EP, to identify and retain only the most relevant experts.<n>Experiments on DeepSeek-R1 and DeepSeek-V3-0324 show that our method can achieve comparable performances and $2.99times$ throughput under the same memory budget with full model with only half the experts.
arXiv Detail & Related papers (2025-04-09T11:34:06Z)
Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations [86.90549830760513]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks.<n>We propose MoE Experts Compression Suite (MC-Suite) to provide a benchmark for estimating expert importance from diverse perspectives.<n>We present an experimentally validated conjecture that, during expert dropping, SMoEs' instruction-following capabilities are predominantly hurt.
arXiv Detail & Related papers (2025-04-08T00:49:08Z)
Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.<n>Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.<n>We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z)
Expert-Agnostic Learning to Defer [4.171294900540735]
We introduce Expert-Agnostic Learning to Defer (EA-L2D), a novel L2D framework that employs a Bayesian approach to model expert behaviour.<n>EA-L2D significantly outperforms prior methods on unseen experts, achieving up to a 28% relative improvement.
arXiv Detail & Related papers (2025-02-14T19:59:25Z)
Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts [44.09546603624385]
We introduce a notion of expert specialization for Soft MoE. We show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset.
arXiv Detail & Related papers (2024-09-02T00:39:00Z)
HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou [19.113649341888532]
We present the practical problems and the lessons learned at short-video services from Kuaishou. In industry, a widely-used multi-task framework is the Mixture-of-Experts (MoE) paradigm.
arXiv Detail & Related papers (2024-08-10T04:25:48Z)
Learning More Generalized Experts by Merging Experts in Mixture-of-Experts [0.5221459608786241]
We show that incorporating a shared layer in a mixture-of-experts can lead to performance degradation. We merge the two most frequently selected experts and update the least frequently selected expert using the combination of experts. Our algorithm enhances transfer learning and mitigates catastrophic forgetting when applied to multi-domain task incremental learning.
arXiv Detail & Related papers (2024-05-19T11:55:48Z)
Inverse Reinforcement Learning with Sub-optimal Experts [56.553106680769474]
We study the theoretical properties of the class of reward functions that are compatible with a given set of experts. Our results show that the presence of multiple sub-optimal experts can significantly shrink the set of compatible rewards. We analyze a uniform sampling algorithm that results in being minimax optimal whenever the sub-optimal experts' performance level is sufficiently close to the one of the optimal agent.
arXiv Detail & Related papers (2024-01-08T12:39:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.