Related papers: MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

URL: http://arxiv.org/abs/2508.09670v1
Date: Wed, 13 Aug 2025 09:58:10 GMT
Title: MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement
Authors: Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, Hao Feng, Irene Li, Kun Yang, Han Wang, Jingqun Tang, Teng Fu, Changhong Jin, Chao Feng, Xiaohui Lv, Can Huang,
Abstract summary: Multi-Expert Mutual Learning GRPO is an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses.<n>We show that MEML- GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama.
Score: 37.880962254812175
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model's performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.

Related papers

DARL: Encouraging Diverse Answers for General Reasoning without Verifiers [41.35516261603945]
We propose DARL, a reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference.<n>Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers.
arXiv Detail & Related papers (2026-01-21T06:23:55Z)
Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs [49.72591739116668]
Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs)<n>Existing methods address this issue by imitating expert trajectories, which improve effectiveness but neglect diversity.<n>We propose MENTOR: Mixed-policy Expert Navigation for Token-level Optimization of Reasoning.
arXiv Detail & Related papers (2025-10-05T10:38:55Z)
More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration [103.1589018460702]
"guidance-on-demand" approach expands exploration while preserving the value of self-discovery.<n>Experiments show AMPO substantially outperforms a strong baseline.<n>Using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher.
arXiv Detail & Related papers (2025-10-02T17:14:00Z)
Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models [22.50153462109328]
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs)<n>We introduce a Risk-Sensitive Reinforcement Learning framework.<n>Our approach employs a risk-seeking objective that interpolates between mean and maximum rewards, leading to a novel algorithm.<n>Remarkably, RS-GRPO is simple to implement, requiring only minor code modifications.
arXiv Detail & Related papers (2025-09-29T04:12:20Z)
CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models [85.315711639214]
We introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration.<n>For the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture.<n>Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses.
arXiv Detail & Related papers (2025-09-11T17:59:17Z)
From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR [92.51110344832178]
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs)<n>This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects.
arXiv Detail & Related papers (2025-08-11T01:26:16Z)
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z)
RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism [10.288667305064065]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks.<n>LLMs remain prone to generating hallucinated or outdated responses due to their static internal knowledge.<n>Recent advancements in Retrieval-Augmented Generation (RAG) methods have aimed to enhance models' search and reasoning capabilities.
arXiv Detail & Related papers (2025-06-30T09:02:45Z)
Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models [22.10168313140081]
We introduce ERL-VLM, an enhanced rating-based reinforcement learning method that learns reward functions from AI feedback.<n>ERL-VLM queries large vision-language models for absolute ratings of individual trajectories, enabling more expressive feedback.<n>We demonstrate that ERL-VLM significantly outperforms existing VLM-based reward generation methods.
arXiv Detail & Related papers (2025-06-15T12:05:08Z)
Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward [87.06604760273372]
We propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately.<n>We show that Perception-R1 achieves state-of-the-art performance on most benchmarks using only 1,442 training data.
arXiv Detail & Related papers (2025-06-08T16:48:42Z)
CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG [53.950029990391066]
Cross-source knowledge textbfReconciliation for Multimodal RAG (CoRe-MMRAG)<n>We propose a novel end-to-end framework that effectively reconciles inconsistencies across knowledge sources.<n>Experiments on KB-VQA benchmarks show that CoRe-MMRAG achieves substantial improvements over baseline methods.
arXiv Detail & Related papers (2025-06-03T07:32:40Z)
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.