Fugu-MT 論文翻訳(概要): Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

論文の概要: Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

arxiv url: http://arxiv.org/abs/2603.12180v1
Date: Thu, 12 Mar 2026 17:11:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.244229
Title: Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
Title（参考訳）: ストラテジックナビゲーションか確率探索か : エージェントと人間が文書コレクションについてどのように考えるか
Authors: Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta,
Abstract要約: 我々は,800の異種PDF文書に基づいた2,250人の人間による質問のベンチマークであるMADQAを紹介する。最適なエージェントは、人間の検索者を生の正確さで一致させることができるが、それらはほとんど異なる質問に成功し、弱い戦略計画の補足のためにブルートフォースサーチに依存している。我々は、ブルートフォース検索からキャリブレーションされた効率的な推論への移行を支援するために、データセットと評価ハーネスをリリースする。
参考スコア（独自算出の注目度）: 37.38277822936901
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
Abstract（参考訳）: マルチモーダルエージェントは、複雑なドキュメント集約ワークフローを自動化するための有望なパスを提供する。しかし、重要な疑問が残る:これらのエージェントは真の戦略的推論を示すのか、それとも単なる確率的試行錯誤検索なのか? そこで本研究では,800件の異種PDF文書を根拠とした2250件の人間による質問のベンチマークであるMADQAを紹介する。古典的テスト理論によって導かれ、エージェント能力の様々なレベルにまたがる差別力を最大化するように設計されている。エージェントの動作を評価するために,精度・効率のトレードオフを測定する新しい評価プロトコルを提案する。この枠組みを用いることで、最高のエージェントが人間の検索者を生の正確さでマッチングできる一方で、かなり異なる質問に成功し、弱い戦略計画の補足のためにブルートフォースサーチに頼っていることを示す。オラクルのパフォーマンスと約20%のギャップを埋めることに失敗し、非生産的なループで持続する。我々は、ブルートフォース検索からキャリブレーションされた効率的な推論への移行を支援するために、データセットと評価ハーネスをリリースする。

論文の概要: Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

関連論文リスト