Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs
- URL: http://arxiv.org/abs/2510.04140v1
- Date: Sun, 05 Oct 2025 10:38:55 GMT
- Title: Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs
- Authors: Zishang Jiang, Jinyi Han, Tingyun Li, Xinyi Wang, Sihang Jiang, Jiaqing Liang, Zhaoqian Dai, Shuguang Ma, Fei Yu, Yanghua Xiao,
- Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs)<n>Existing methods address this issue by imitating expert trajectories, which improve effectiveness but neglect diversity.<n>We propose MENTOR: Mixed-policy Expert Navigation for Token-level Optimization of Reasoning.
- Score: 49.72591739116668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs). However, the effectiveness of RLVR strongly depends on the capability of base models. This issue arises because it requires the model to have sufficient capability to perform high-quality exploration, which involves both effectiveness and diversity. Unfortunately, existing methods address this issue by imitating expert trajectories, which improve effectiveness but neglect diversity. To address this, we argue that the expert only needs to provide guidance only at critical decision points rather than the entire reasoning path. Based on this insight, we propose MENTOR: Mixed-policy Expert Navigation for Token-level Optimization of Reasoning, a framework that provides expert guidance only at critical decision points to perform effective and diverse exploration in RLVR. Extensive experiments show that MENTOR enables models capture the essence of expert strategies rather than surface imitation, thereby performing high-quality exploration and achieving superior overall performance. Our code is available online.
Related papers
- Going Beyond Expert Performance via Deep Implicit Imitation Reinforcement Learning [3.691573844585973]
We introduce a deep implicit imitation reinforcement learning framework that combines deep reinforcement learning with implicit imitation learning from observation-only datasets.<n>Our main algorithm, Deep Implicit Q-Network (DIIQN), employs an action inference mechanism that reconstructs expert actions through online exploration.<n>We further extend our framework with a Heterogeneous Actions DIIQN (HA-DIIQN) algorithm to tackle scenarios where expert and agent possess different action sets.
arXiv Detail & Related papers (2025-11-05T16:33:39Z) - More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration [103.1589018460702]
"guidance-on-demand" approach expands exploration while preserving the value of self-discovery.<n>Experiments show AMPO substantially outperforms a strong baseline.<n>Using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher.
arXiv Detail & Related papers (2025-10-02T17:14:00Z) - CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models [85.315711639214]
We introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration.<n>For the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture.<n>Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses.
arXiv Detail & Related papers (2025-09-11T17:59:17Z) - Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning [56.496001894673235]
Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs)<n>Our analysis reveals that puzzling phenomena like aha moments", length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy.
arXiv Detail & Related papers (2025-09-03T18:52:49Z) - MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement [37.880962254812175]
Multi-Expert Mutual Learning GRPO is an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses.<n>We show that MEML- GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama.
arXiv Detail & Related papers (2025-08-13T09:58:10Z) - From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR [92.51110344832178]
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs)<n>This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects.
arXiv Detail & Related papers (2025-08-11T01:26:16Z) - RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification [4.722248376235009]
Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy.<n>We propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED)<n>Our approach achieves performance that surpasses SBRL baselines across various benchmarks.
arXiv Detail & Related papers (2025-06-06T10:59:39Z) - Preference-Guided Reinforcement Learning for Efficient Exploration [14.058764537783086]
We introduce LOPE: textbfLearning textbfOnline with trajectory textbfPreferencedanctextbfE, an end-to-end preference-guided RL framework.<n>Our intuition is that LOPE directly adjusts the focus of online exploration by considering human feedback as guidance.<n>LOPE outperforms several state-of-the-art methods in terms of convergence rate and overall performance.
arXiv Detail & Related papers (2024-07-09T02:11:12Z) - Soft Expert Reward Learning for Vision-and-Language Navigation [94.86954695912125]
Vision-and-Language Navigation (VLN) requires an agent to find a specified spot in an unseen environment by following natural language instructions.
We introduce a Soft Expert Reward Learning (SERL) model to overcome the reward engineering designing and generalisation problems of the VLN task.
arXiv Detail & Related papers (2020-07-21T14:17:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.