Fugu-MT 論文翻訳(概要): EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

論文の概要: EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

arxiv url: http://arxiv.org/abs/2509.23730v1
Date: Sun, 28 Sep 2025 08:20:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.406494
Title: EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance
Title（参考訳）: EAPO: オンデマンド専門家支援による政策最適化の強化
Authors: Siyao Song, Cong Ma, Zhihao Cheng, Shiye Lei, Minghao Li, Ying Zeng, Huaixiao Tou, Kai Jia,
Abstract要約: 大規模言語モデル (LLM) は、最近、検証可能な報酬の下で強化学習 (RL) で最適化された場合、推論において進歩している。本稿では,外部の専門家とのマルチターンインタラクションを取り入れた新しいRLフレームワークEAPOを提案する。 EAPOは、いつ、どのように専門家に相談するかを適応的に決定し、よりリッチな報酬信号とより信頼性の高い推論軌跡を得る政策を奨励する。
参考スコア（独自算出の注目度）: 19.21616215817727
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards. Existing methods primarily rely on outcome-based supervision to strengthen internal LLM reasoning, often leading to inefficient exploration and sparse rewards. To mitigate this issue, we propose Expert-Assisted Policy Optimization (EAPO), a novel RL framework that enhances exploration by incorporating multi-turn interactions with external experts during training. Unlike prior methods, where policies reason in isolation, EAPO incentivizes the policy to adaptively determine when and how to consult experts, yielding richer reward signals and more reliable reasoning trajectories. External assistance ultimately internalizes expert knowledge into the policy model, amplifying the model's inherent reasoning capabilities. During evaluation, the policy model has been well-optimized to solve questions independently, producing improved reasoning paths and more accurate solutions. Experiments on mathematical reasoning benchmarks, including AIME 2024, AIME 2025, and AIMO 2025, show that EAPO consistently outperforms expert-assisted workflow, expert-distilled models, and RL baselines, with an average gain of 5 points over self-exploratory models.
Abstract（参考訳）: 大規模言語モデル (LLM) は、最近、検証可能な報酬の下で強化学習 (RL) で最適化された場合、推論において進歩している。既存の手法は主に、内部のLCM推論を強化するために結果に基づく監督に依存しており、しばしば非効率な探索とスパース報酬につながる。この問題を軽減するため,訓練中に外部の専門家とのマルチターンインタラクションを取り入れた新たなRLフレームワークであるExpert-Assisted Policy Optimization (EAPO)を提案する。政策が孤立している従来の方法とは異なり、EAPOは専門家にいつ、どのように相談するかを適応的に決定し、よりリッチな報酬信号とより信頼性の高い推論軌道を与える政策を動機付けている。外部支援は、最終的に専門家の知識をポリシーモデルに内部化し、モデル固有の推論能力を増幅する。評価期間中、ポリシーモデルは独立して問題解決に最適化され、より良い推論経路とより正確な解が生み出された。 AIME 2024、AIME 2025、AIMO 2025などの数学的推論ベンチマークの実験は、EAPOが専門家支援ワークフロー、エキスパート蒸留モデル、RLベースラインを一貫して上回り、自己探索モデルよりも平均5ポイント上昇していることを示している。

論文の概要: EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

関連論文リスト