Fugu-MT 論文翻訳(概要): Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

論文の概要: Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

arxiv url: http://arxiv.org/abs/2510.14967v1
Date: Thu, 16 Oct 2025 17:59:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-17 21:15:15.001732
Title: Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents
Title（参考訳）: 情報ゲインに基づくポリシー最適化:マルチターンLDMエージェントの簡便かつ効果的なアプローチ
Authors: Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying,
Abstract要約: 大規模言語モデル(LLM)ベースのエージェントは、外部環境と対話する能力を高めるために強化学習(RL)でますます訓練されている。既存のアプローチは通常、最終回答でのみ提供される結果に基づく報酬に依存します。本稿では,情報ゲインに基づくポリシー最適化(IGPO)を提案する。
参考スコア（独自算出の注目度）: 28.145430029174577
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.
Abstract（参考訳）: 大規模言語モデル(LLM)ベースのエージェントは、強化学習(RL)を用いて、ツールの使用、特に多ターン推論と知識獲得を必要とする検索ベースの設定を通じて、外部環境と対話する能力を高めるために、ますます訓練されている。しかし、既存のアプローチは通常、最終回答でのみ提供される結果に基づく報酬に依存します。この報酬幅は、長い軌道が2つの重要な問題を悪化させるマルチターン設定において特に問題となる。一すべてのロールアウトが同一の報酬を受け取り、有用な学習信号を提供しない有利な崩壊 (二)特に長期タスクにおいて、ターン間の依存関係が曖昧になる、きめ細かい信用割当の欠如。本稿では,情報ゲインに基づくポリシー最適化(IGPO)を提案する。 IGPOは、各インタラクションを、基礎的真実に関する情報を取得する段階的なプロセスとしてモデル化し、ターンレベルの報酬を、ポリシーの正解の確率の限界的な増加として定義する。外部報酬モデルやコストのかかるモンテカルロ推定に依存する従来のプロセスレベルの報酬アプローチとは異なり、IGPOはモデル自身の信念更新から直接本質的な報酬を導き出す。これらの内在的なターンレベルの報酬は、結果レベルの監督と組み合わせて、密度の高い報酬軌道を形成する。ドメイン内ベンチマークとドメイン外ベンチマークの両方での大規模な実験により、IGPOはマルチターンシナリオにおいて強いベースラインを一貫して上回り、精度の向上とサンプル効率の向上を実現している。

論文の概要: Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

関連論文リスト