Fugu-MT 論文翻訳(概要): MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

論文の概要: MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

arxiv url: http://arxiv.org/abs/2603.05697v1
Date: Thu, 05 Mar 2026 21:43:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:44.582111
Title: MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents
Title（参考訳）: MultiHaystack:40K以上の画像、ビデオ、ドキュメントのマルチモーダル検索と推論のベンチマーク
Authors: Dannong Xu, Zhongyu Yang, Jun Chen, Yingfang Yuan, Ming Hu, Lei Sun, Luc Van Gool, Danda Pani Paudel, Chun-Mei Feng,
Abstract要約: MultiHaystackは、大規模なクロスモーダル条件下での検索と推論の両方を評価するために設計された最初のベンチマークである。モデルが対応するエビデンスを付与した場合,その性能は,全コーパスからそのエビデンスを取得するために必要な場合,急激に低下することがわかった。
参考スコア（独自算出の注目度）: 57.32877731797049
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool, requiring evidence localization across modalities and fine-grained reasoning. In our study, we find that models perform competitively when provided with the corresponding evidence, but their performance drops sharply when required to retrieve that evidence from the full corpus. Additionally, even the strongest retriever, E5-V, achieves only 40.8% Recall@1, while state-of-the-art MLLMs such as GPT-5 experience a significant drop in reasoning accuracy from 80.86% when provided with the corresponding evidence to 51.4% under top-5 retrieval. These results indicate that multimodal retrieval over heterogeneous pools remains a primary bottleneck for MLLMs, positioning MultiHaystack as a valuable testbed that highlights underlying limitations obscured by small-scale evaluations and promotes retrieval-centric advances in multimodal systems.
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)は、テキスト、画像、ビデオの理解を別々に評価するベンチマークにおいて、強力なパフォーマンスを達成する。しかし、これらの設定は、推論の前に大きな、異質な多モードコーパスから関連する証拠を取得することを含む、重要な現実世界の要件を評価しない。既存のベンチマークのほとんどは、検索を小さな単一のモダリティ候補セットに制限しており、検索空間を大幅に単純化し、エンドツーエンドの信頼性を過大評価している。このギャップに対処するために,大規模・クロスモーダル条件下での検索と推論の両方を評価するために設計された最初のベンチマークであるMultiHaystackを導入する。 MultiHaystackは、ドキュメント、画像、ビデオにまたがる46,000以上のマルチモーダル検索候補と、オープンで検証可能な747の質問で構成されている。各質問は、検索プール内のユニークな検証済みのエビデンス項目に基礎を置いており、モダリティ間のエビデンスローカライゼーションときめ細かい推論を必要とする。本研究では, モデルに対応する証拠が与えられた場合, モデルが競争力を発揮するが, その証拠を全コーパスから回収する必要がある場合, それらの性能は急激に低下することがわかった。さらに、最強のレトリバーであるE5-Vでさえ40.8%のリコール@1しか達成せず、GPT-5のような最先端のMLLMは、トップ5検索で51.4%の証拠を与えると、推理精度が80.86%から大幅に低下する。これらの結果から,MultiHaystackは,小規模評価で曖昧な基礎的限界を浮き彫りにして,マルチモーダルシステムにおける検索中心の進歩を促進する,有意義なテストベッドとして位置づけられている。

論文の概要: MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

関連論文リスト