EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance
- URL: http://arxiv.org/abs/2509.23730v1
- Date: Sun, 28 Sep 2025 08:20:22 GMT
- Title: EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance
- Authors: Siyao Song, Cong Ma, Zhihao Cheng, Shiye Lei, Minghao Li, Ying Zeng, Huaixiao Tou, Kai Jia,
- Abstract summary: Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards.<n>We propose Expert-Assisted Policy Optimization (EAPO), a novel RL framework that enhances exploration by incorporating multi-turn interactions with external experts during training.<n>EAPO incentivizes the policy to adaptively determine when and how to consult experts, yielding richer reward signals and more reliable reasoning trajectories.
- Score: 19.21616215817727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards. Existing methods primarily rely on outcome-based supervision to strengthen internal LLM reasoning, often leading to inefficient exploration and sparse rewards. To mitigate this issue, we propose Expert-Assisted Policy Optimization (EAPO), a novel RL framework that enhances exploration by incorporating multi-turn interactions with external experts during training. Unlike prior methods, where policies reason in isolation, EAPO incentivizes the policy to adaptively determine when and how to consult experts, yielding richer reward signals and more reliable reasoning trajectories. External assistance ultimately internalizes expert knowledge into the policy model, amplifying the model's inherent reasoning capabilities. During evaluation, the policy model has been well-optimized to solve questions independently, producing improved reasoning paths and more accurate solutions. Experiments on mathematical reasoning benchmarks, including AIME 2024, AIME 2025, and AIMO 2025, show that EAPO consistently outperforms expert-assisted workflow, expert-distilled models, and RL baselines, with an average gain of 5 points over self-exploratory models.
Related papers
- Provable and Practical In-Context Policy Optimization for Self-Improvement [49.670847804409874]
We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference.<n>We introduce In-Context Policy Optimization (ICPO), in which an agent optimize its response in context using self-assessed or externally observed rewards without modifying its parameters.<n>We propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time.
arXiv Detail & Related papers (2026-03-02T00:21:50Z) - Think Outside the Policy: In-Context Steered Policy Optimization [13.24687763539952]
In-context Steered Policy Optimization provides expert guidance using existing datasets.<n>ICPO consistently enhances reinforcement learning performance and training stability on mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-10-30T14:14:15Z) - Think before Recommendation: Autonomous Reasoning-enhanced Recommender [25.883091131835172]
RecZero is a reinforcement learning-based recommendation paradigm that abandons the traditional multi-model and multi-stage distillation approach.<n>The paper explores a hybrid paradigm, RecOne, which combines supervised fine-tuning with RL, initializing the model with cold-start reasoning samples and further optimizing it with RL.
arXiv Detail & Related papers (2025-10-27T07:26:32Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities [45.989423626537985]
Reinforcement learning (RL) has emerged as an effective method for training reasoning models.<n>We propose TAPO, a framework that augments RL by incorporating external high-level guidance ("thought patterns")<n>Our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math.
arXiv Detail & Related papers (2025-05-21T16:06:10Z) - Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training [86.70255651945602]
We introduce a novel inference-time steering methodology called Reinforcing Cognitive Experts (RICE)<n>RICE aims to improve reasoning performance without additional training or complexs.<n> Empirical evaluations with leading MoE-based LRMs demonstrate noticeable and consistent improvements in reasoning accuracy, cognitive efficiency, and cross-domain generalization.
arXiv Detail & Related papers (2025-05-20T17:59:16Z) - A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z) - Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization [22.67700436936984]
We introduce Direct Advantage Policy Optimization (DAPO), a novel step-level offline reinforcement learning algorithm.<n>DAPO employs a critic function to predict the reasoning accuracy at each step, thereby generating dense signals to refine the generation strategy.<n>Our results show that DAPO can effectively enhance the mathematical and code capabilities on both SFT models and RL models, demonstrating the effectiveness of DAPO.
arXiv Detail & Related papers (2024-12-24T08:39:35Z) - From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.<n>We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z) - Secrets of RLHF in Large Language Models Part I: PPO [81.01936993929127]
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence.
reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit.
In this report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training.
arXiv Detail & Related papers (2023-07-11T01:55:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.