Fugu-MT 論文翻訳(概要): Learning Agentic Policy from Action Guidance

論文の概要: Learning Agentic Policy from Action Guidance

arxiv url: http://arxiv.org/abs/2605.12004v1
Date: Tue, 12 May 2026 11:54:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.833991
Title: Learning Agentic Policy from Action Guidance
Title（参考訳）: アクションガイダンスによるエージェントポリシーの学習
Authors: Yuxiang Ji, Zengbin Wang, Yong Wang, Shidong Yang, Ziyu Ma, Guanhua Chen, Zonghua Sun, Liaoni Wu, Xiangxiang Chu,
Abstract要約: 我々は,行動データを計画スタイルの参照ガイダンスとして注入するtextscActGuide-RLを提案する。ガイド付きロールアウトとガイドなしロールアウトは、混合政治訓練によって共同で最適化される。検索エージェントのベンチマークでは、textscActGuide-RLはゼロRLよりも大幅に改善されている。
参考スコア（独自算出の注目度）: 21.951262624996982
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional training or external guidance is needed to recover effective learning signals. Rather than relying on costly iterative supervised fine tuning (SFT), we exploit the abundant action data generated in everyday human interactions. We propose \textsc{ActGuide-RL}, which injects action data as plan-style reference guidance, enabling the agentic policy to overcome reachability barriers to reward states. Guided and unguided rollouts are then jointly optimized via mixed-policy training, internalizing the exploration gains back into the unguided policy. Motivated by a theoretical and empirical analysis of the benefit-risk trade-off, we adopt a minimal intervention principle that invokes guidance only as an adaptive fallback, matching task difficulty while minimizing off-policy risk. On search-agent benchmarks, \textsc{ActGuide-RL} substantially improves over zero RL (+10.7 pp on GAIA and +19 pp on XBench with Qwen3-4B), and performs on par with the SFT+RL pipeline without any cold start. This suggests a new paradigm for agentic RL that reduces the reliance on heavy SFT data by using scalable action guidance instead.
Abstract（参考訳）: 大規模言語モデル(LLM)のためのエージェント強化学習(RL)は、訓練信号が機能領域内でのみ現れるため、基本方針の探索能力に依存する。基本方針が報酬状態に到達できないタスクに対しては、効果的な学習信号を回復するために追加のトレーニングや外部ガイダンスが必要である。コストのかかる反復的教師付き微調整(SFT)に頼るのではなく、日常の人間同士のインタラクションで発生する豊富なアクションデータを活用する。本稿では,行動データをプランスタイルの参照ガイダンスとして注入し,エージェントポリシーが到達可能性障壁を克服し,状態に報酬を与えるための手段である「textsc{ActGuide-RL}」を提案する。ガイド付きおよびガイドなしのロールアウトは、混合政治訓練によって共同で最適化され、ガイドなしのポリシーへの探索ゲインを内部化します。利益リスクトレードオフの理論的かつ実証的な分析により、我々は、適応的なフォールバックとしてのみガイダンスを起動する最小限の介入原則を採用し、非政治リスクを最小化しつつ、タスクの難易度に適合する。サーチエージェントベンチマークでは、 \textsc{ActGuide-RL} はゼロ RL (GAIAでは+10.7 pp、Qwen3-4Bでは+19 pp) を大幅に改善し、コールドスタートなしでSFT+RLパイプラインと同等に動作する。これは、代わりにスケーラブルなアクションガイダンスを使用することで、重いSFTデータへの依存を低減するエージェントRLの新しいパラダイムを示唆している。

論文の概要: Learning Agentic Policy from Action Guidance

関連論文リスト