Fugu-MT 論文翻訳(概要): Repurposing Synthetic Data for Fine-grained Search Agent Supervision

論文の概要: Repurposing Synthetic Data for Fine-grained Search Agent Supervision

arxiv url: http://arxiv.org/abs/2510.24694v1
Date: Tue, 28 Oct 2025 17:50:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:37.321028
Title: Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Title（参考訳）: きめ細かいサーチエージェントシミュレーションのための合成データの再取得
Authors: Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang,
Abstract要約: LLMベースの検索エージェントは、エンティティ中心の合成データに基づいてますます訓練されている。一般的なトレーニングメソッドは、このリッチなエンティティ情報を破棄し、代わりにスパースで結果に基づく報酬に依存します。 E-GRPO(Entity-Aware Group Relative Policy Optimization)は、高密度なエンティティ認識報酬関数を定式化する新しいフレームワークである。
参考スコア（独自算出の注目度）: 81.95597592711688
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.
Abstract（参考訳）: LLMベースの検索エージェントは、複雑な知識集約的なタスクを解決するために、エンティティ中心の合成データでますます訓練されている。しかし、グループ相対政策最適化(GRPO)のような一般的なトレーニング手法は、少ない結果に基づく報酬に頼る代わりに、このリッチなエンティティ情報を破棄します。この限界は、情報的な「近距離」サンプルをかなり正しい推論で区別できないが、完全な失敗からの最終的な答えに欠陥があるため、貴重な学習信号を破棄する。トレーニング中に破棄されたエンティティを活用することで、この問題に対処します。実験により,エージェントの推論過程において同定された接地的実体数と最終回答精度との間には,強い正の相関が認められた。この知見に基づいて、高密度なエンティティ対応報酬関数を定式化する新しいフレームワークであるEntity-Aware Group Relative Policy Optimization (E-GRPO)を紹介します。 E-GRPOは、エンティティマッチングレートに比例した不正なサンプルに部分報酬を割り当て、モデルがこれらの「ニアミス」から効果的に学習できるようにする。多様な質問回答(QA)と深層調査ベンチマークの実験は、E-GRPOがGRPOベースラインを一貫して大幅に上回っていることを示している。さらに,E-GRPOはより優れた精度を達成できるだけでなく,ツールコールの少ない推論ポリシーを誘導し,検索エージェントの整合に対するより効率的でサンプル効率のよいアプローチを示す。

論文の概要: Repurposing Synthetic Data for Fine-grained Search Agent Supervision

関連論文リスト