Fugu-MT 論文翻訳(概要): LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

論文の概要: LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

arxiv url: http://arxiv.org/abs/2605.28721v1
Date: Wed, 27 May 2026 16:39:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:56.222384
Title: LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
Title（参考訳）: LiveBrowseComp: 検索エージェントは検索中か、それとも、すでに知っていることを検証中か?
Authors: HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang, Ming Liu, Bing Qin, XingYu,
Abstract要約: LLMベースの検索エージェントは、外部の証拠よりも本質的な知識に依存している。 LiveBrowseCompは、固有のカバレッジ以上のエージェントを評価するために設計されたベンチマークである。評価されたすべてのエージェントは、クローズドブックの精度を2%以下に抑え、検索強化されたスコアは、BrowseCompと比較して25～40ポイント減少し、以前のモデルランキングは、もはやパフォーマンスを確実に予測することができない。
参考スコア（独自算出の注目度）: 32.434901767447165
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.
Abstract（参考訳）: LLMベースの検索エージェントは真に検索しているか、それともウェブを使って既に知っていることを検証しているのか? 本稿では,BrowseCompについて3つの診断法を用いて検討する。ツールアクセスであっても、エージェントは、外部の証拠ではなく、検索前にモデルにエンコードされた情報である本質的な知識に依存します。エージェントはツールを使わずに44.5%のBrowseComp質問に回答し、検索された手がかりではなく内部で生成された仮説から検索クエリの半分以上を生成し、回答を支持する証拠が削除された場合、クローズドブックのベースラインよりも悪い結果をもたらす。これらの結果は、静的検索ベンチマークがエビデンス駆動の発見よりもメモリ支援による検証に報いることを示唆している。次に、本質的なカバレッジ以上のエージェントを評価するために設計された、Deep-searchベンチマークであるLiveBrowseCompを紹介します。そこには、335人の人間による質問が含まれており、その答えは、ベンチマーク構築前の90日以内に公表された事実に依存している。 LiveBrowseCompでは、評価されたすべてのエージェントは、2%のクローズドブック精度以下、検索強化されたスコアは、BrowseCompと比較して25～40ポイント減少し、以前のモデルランキングは、もはやパフォーマンスを確実に予測しない。 LiveBrowseCompはhttps://huggingface.co/datasets/Forival/LiveBrowseCompで入手できる。

論文の概要: LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

関連論文リスト