Fugu-MT 論文翻訳(概要): Evaluating the Search Agent in a Parallel World

論文の概要: Evaluating the Search Agent in a Parallel World

arxiv url: http://arxiv.org/abs/2603.04751v1
Date: Thu, 05 Mar 2026 02:56:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-06 22:06:11.045933
Title: Evaluating the Search Agent in a Parallel World
Title（参考訳）: 並列世界における探索エージェントの評価
Authors: Jiawei Chen, Xintian Shen, Lihao Zheng, Lifu Mu, Haoyi Sun, Ning Mao, Hao Ma, Tao Wei, Pan Zhou, Kun Zhan,
Abstract要約: 並列世界における検索エージェント評価フレームワークであるMind-ParaWorldを紹介する。パラワールド・ロー・モデル(ParaWorld Law Model)は、不可分なアトミック・ファクトの集合と、各質問に対する独特な基礎構造を構築する。評価中、現実世界の結果を取得する代わりに、エージェントはパラワールドモデルと対話する。 MPW-Benchは19のドメインと1,608のインスタンスにまたがる対話型ベンチマークです。
参考スコア（独自算出の注目度）: 28.24678964635285
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Integrating web search tools has significantly extended the capability of LLMs to address open-world, real-time, and long-tail problems. However, evaluating these Search Agents presents formidable challenges. First, constructing high-quality deep search benchmarks is prohibitively expensive, while unverified synthetic data often suffers from unreliable sources. Second, static benchmarks face dynamic obsolescence: as internet information evolves, complex queries requiring deep research often degrade into simple retrieval tasks due to increased popularity, and ground truths become outdated due to temporal shifts. Third, attribution ambiguity confounds evaluation, as an agent's performance is often dominated by its parametric memory rather than its actual search and reasoning capabilities. Finally, reliance on specific commercial search engines introduces variability that hampers reproducibility. To address these issues, we propose a novel framework, Mind-ParaWorld, for evaluating Search Agents in a Parallel World. Specifically, MPW samples real-world entity names to synthesize future scenarios and questions situated beyond the model's knowledge cutoff. A ParaWorld Law Model then constructs a set of indivisible Atomic Facts and a unique ground-truth for each question. During evaluation, instead of retrieving real-world results, the agent interacts with a ParaWorld Engine Model that dynamically generates SERPs grounded in these inviolable Atomic Facts. We release MPW-Bench, an interactive benchmark spanning 19 domains with 1,608 instances. Experiments across three evaluation settings show that, while search agents are strong at evidence synthesis given complete information, their performance is limited not only by evidence collection and coverage in unfamiliar search environments, but also by unreliable evidence sufficiency judgment and when-to-stop decisions-bottlenecks.
Abstract（参考訳）: Web検索ツールの統合は、オープンワールド、リアルタイム、ロングテールの問題に対処するLLMの機能を大幅に拡張した。しかし、これらの検索エージェントを評価することは、非常に困難な課題である。まず、高品質なディープ・サーチ・ベンチマークを構築するのは非常に高価であるが、検証されていない合成データは信頼性の低い情報源に悩まされることが多い。第二に、静的なベンチマークは動的不明瞭さに直面している:インターネット情報が進化するにつれて、深い研究を必要とする複雑なクエリは、人気が増すにつれて単純な検索タスクに分解され、時相的な変化によって地底の真実は時代遅れになる。第3に、エージェントのパフォーマンスは実際の探索と推論能力よりも、そのパラメトリックメモリに支配されているため、属性の曖昧さが評価を裏付ける。最後に、特定の商用検索エンジンへの依存は、再現性を損なう変動をもたらす。これらの課題に対処するために,並列世界における検索エージェントの評価のための新しいフレームワークであるMind-ParaWorldを提案する。具体的には、MPWは現実世界のエンティティ名をサンプリングし、将来のシナリオとモデルの知識の遮断を超えた質問を合成する。パラワールド・ロー・モデル(ParaWorld Law Model)は、各質問に対して、不可分なアトミック・ファクトと独特な基礎構造を構築する。評価中、実世界の結果を取得する代わりに、エージェントはパラワールドエンジンモデルと相互作用し、これらの非暴力性原子Factsに接するSERPを動的に生成する。 MPW-Benchは19のドメインと1,608のインスタンスにまたがる対話型ベンチマークです。 3つの評価環境における実験により,検索エージェントは完全情報に対するエビデンス合成に強いが,その性能は不慣れなサーチ環境におけるエビデンス収集やカバレッジだけでなく,信頼性の低いエビデンス判断や,いつ立ち止まるかの判定によって制限されていることがわかった。

論文の概要: Evaluating the Search Agent in a Parallel World

関連論文リスト