Fugu-MT 論文翻訳(概要): ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

論文の概要: ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

arxiv url: http://arxiv.org/abs/2604.20486v1
Date: Wed, 22 Apr 2026 12:20:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-23 15:36:11.120215
Title: ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards
Title（参考訳）: ProMMSearchAgent:プロセス指向リワードを訓練した汎用マルチモーダル検索エージェント
Authors: Wentao Yan, Shengqin Wang, Huichi Zhou, Yihang Chen, Kun Shao, Yuan Xie, Zhizhong Zhang,
Abstract要約: ProMMSearchAgentは、マルチモーダル検索のための新しいSim-to-Realトレーニングパラダイムを確立する。我々は、正しい認知判断を明示的に報いるような、密集した行動メタデータを生成する。 ProMMSearchAgentは新たなSOTAパフォーマンスを実現し、FVQAテストでは+5.1%、InfoSeekでは+6.3%、MMSearchでは+11.3%を上回った。
参考スコア（独自算出の注目度）: 24.61813749877376
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.
Abstract（参考訳）: 知識集約型視覚推論のための強化学習によるマルチモーダルエージェントの訓練は、結果に基づく監督の極端に疎外性と、ライブウェブ環境の予測不可能によって、基本的に妨げられている。このようなアルゴリズム的・環境的ボトルネックを解決するために,ProMMSearchAgentを導入し,マルチモーダル検索のための新しいSim-to-Realトレーニングパラダイムを構築した。ポリシー学習を決定論的で局所的な静的なサンドボックスに分離する。重要なことは、この制約された環境で効果的に学習するために、イントロスペクティブなプロセス指向報酬を提案する。エージェント自身のパラメトリック知識境界を探索することにより、視覚的または事実的不確実な場合にのみ、マルチモーダルまたはテキスト検索を開始することによって、適切な認知判断を明示的に報いる、密集した行動メタデータを生成する。大規模な実験では、ローカルに訓練されたポリシーが、ゼロショットをライブのGoogle Search APIに転送することを示した。 ProMMSearchAgentは新たなSOTA性能を実現し、FVQAテストでは+5.1%、InfoSeekでは+6.3%、MMSearchでは+11.3%を上回った。

論文の概要: ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

関連論文リスト