Related papers: Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts

Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts

URL: http://arxiv.org/abs/2510.23027v1
Date: Mon, 27 Oct 2025 05:47:48 GMT
Title: Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts
Authors: Di Zhang, Xun Wu, Shaohan Huang, Yaru Hao, Li Dong, Zewen Chi, Zhifang Sui, Furu Wei,
Abstract summary: We propose a novel router-aware approach to optimize importance sampling weights in off-policy reinforcement learning (RL)<n>Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence.<n> Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models.
Score: 113.0656076371565
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in reinforcement learning (RL) have substantially improved the training of large-scale language models, leading to significant gains in generation quality and reasoning ability. However, most existing research focuses on dense models, while RL training for Mixture-of-Experts (MoE) architectures remains underexplored. To address the instability commonly observed in MoE training, we propose a novel router-aware approach to optimize importance sampling (IS) weights in off-policy RL. Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence. Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models, highlighting the potential of RL algorithmic innovations tailored to MoE architectures and providing a promising direction for efficient training of large-scale expert models.

Related papers

Enhancing Diffusion-based Restoration Models via Difficulty-Adaptive Reinforcement Learning with IQA Reward [93.04811239892852]
Reinforcement Learning (RL) has recently been incorporated into diffusion models.<n>In this paper, we investigate how to effectively integrate RL into diffusion-based restoration models.
arXiv Detail & Related papers (2025-11-03T14:57:57Z)
Adversarial Diffusion for Robust Reinforcement Learning [46.44328012099217]
We introduce Adversarial Diffusion for Robust Reinforcement Learning (AD-RRL)<n>AD-RRL guides the diffusion process to generate worst-case trajectories during training, effectively optimizing the Conditional Value at Risk (CVaR) of the cumulative return.<n> Empirical results across standard benchmarks show that AD-RRL achieves superior robustness and performance compared to existing robust RL methods.
arXiv Detail & Related papers (2025-09-28T12:34:35Z)
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
Robust Evolutionary Multi-Objective Network Architecture Search for Reinforcement Learning (EMNAS-RL) [43.108040967674185]
This paper introduces Evolutionary Multi-Objective Network Architecture Search (EMNAS) for the first time to optimize neural network architectures in large-scale Reinforcement Learning for Autonomous Driving (AD)<n> EMNAS uses genetic algorithms to automate network design, tailored to enhance rewards and reduce model size without compromising performance.
arXiv Detail & Related papers (2025-06-10T07:52:35Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
Model Merging in Pre-training of Large Language Models [39.413435498849445]
We present a comprehensive investigation of model merging techniques during the pre-training process.<n>We show that merging checkpoints trained with constant learning rates achieves significant performance improvements.<n>We offer the open-source community practical pre-training guidelines for effective model merging.
arXiv Detail & Related papers (2025-05-17T16:53:14Z)
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z)
On the Diminishing Returns of Complex Robust RAG Training in the Era of Powerful LLMs [85.688901949146]
We investigate the question: does the benefit of complex robust training methods diminish as language models become more powerful?<n>Our analysis reveals a consistent trend: emphthe marginal robustness benefit of sophisticated training strategies decreases substantially as model capacity increases.<n>Further investigation demonstrates that stronger models naturally exhibit better confidence calibration, cross-dataset generalization capability, and more effective attention patterns, even under simple training regimes.
arXiv Detail & Related papers (2025-02-17T03:34:31Z)
Mixture of Experts in a Mixture of RL settings [15.124698782503248]
We show that MoEs can boost Deep Reinforcement Learning (DRL) performance by expanding the network's parameter count while reducing dormant neurons. We shed more light on MoEs' ability to deal with non-stationarity and investigate MoEs in DRL settings with "amplified" non-stationarity via multi-task training.
arXiv Detail & Related papers (2024-06-26T15:15:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.