Fugu-MT 論文翻訳(概要): InfoFlow: Reinforcing Search Agent Via Reward Density Optimization

論文の概要: InfoFlow: Reinforcing Search Agent Via Reward Density Optimization

arxiv url: http://arxiv.org/abs/2510.26575v1
Date: Thu, 30 Oct 2025 15:03:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-31 16:05:09.871244
Title: InfoFlow: Reinforcing Search Agent Via Reward Density Optimization
Title（参考訳）: InfoFlow:Reinforcecing Search Agent Via Reward Density Optimization
Authors: Kun Luo, Hongjin Qian, Zheng Liu, Ziyi Xia, Shitao Xiao, Siqi Bao, Jun Zhao, Kang Liu,
Abstract要約: Reinforcement Learning with Verifiable Rewards (RLVR) はエージェントディープサーチを強化するための有望なアプローチである。本稿では,この課題を,探索費用単位当たりの報酬改善を目的としたtextbfReward Density Optimization 問題として定式化する。この問題に3つの側面から対処する体系的なフレームワークである textbfInfoFlow を紹介します。
参考スコア（独自算出の注目度）: 37.266452141225415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbf{Reward Density} in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbf{Reward Density Optimization} problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbf{InfoFlow}, a systematic framework that tackles this problem from three aspects. 1) \textbf{Subproblem decomposition}: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbf{Failure-guided hints}: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbf{Dual-agent refinement}: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher's perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.
Abstract（参考訳）: Reinforcement Learning with Verifiable Rewards (RLVR) はエージェントディープサーチを強化するための有望なアプローチである。しかし、その応用はディープ・サーチ・シナリオにおいて低い textbf{Reward density} によって妨げられることが多く、エージェントは頻繁でしばしばヌルな最終報酬の探索コストを浪費する。本稿では,この課題を,探索コスト単位当たりの報酬改善を目的とした「textbf{Reward Density Optimization}」問題として定式化する。本稿では,この問題に3つの側面から対処する体系的なフレームワークである「textbf{InfoFlow}」を紹介する。 1) \textbf{Subproblem decomposition}: プロセス報酬を割り当てるために長距離タスクを分解し、より密集した学習信号を提供する。 2) \textbf{Failure-guided hints}: 成功の確率を高めるため、停止した軌道に修正ガイダンスを注入する。 3) \textbf{Dual-agent refinement}: 深層探査の認知的負担を和らげるために二重エージェントアーキテクチャを使用する。精錬業者は、研究者の知覚軌道を効果的に圧縮し、探索コストを低減し、全体的な報酬密度を増大させる探索履歴を合成する。我々は、InfoFlowを複数のエージェント検索ベンチマークで評価し、強力なベースラインを著しく上回り、軽量なLLMが高度なプロプライエタリなLLMに匹敵する性能を実現する。

論文の概要: InfoFlow: Reinforcing Search Agent Via Reward Density Optimization

関連論文リスト