Fugu-MT 論文翻訳(概要): LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

論文の概要: LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

arxiv url: http://arxiv.org/abs/2603.14468v1
Date: Sun, 15 Mar 2026 16:20:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.825817
Title: LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos
Title（参考訳）: LongVidSearch:ロングビデオにおけるマルチホップ証拠検索計画のためのエージェントベンチマーク
Authors: Rongyi Yu, Chenyuan Duan, Wentao Zhang,
Abstract要約: LongVidSearchは、ロングビデオにおけるエージェント的マルチホップエビデンス検索計画を評価するためのベンチマークである。 Hop-k の質問は、正確に k 個のエビデンス・クリップを必要とする。
参考スコア（独自算出の注目度）: 7.139631028105273
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation. To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark contains 3,000 questions over 447 long videos (average length 26 minutes), covering four reasoning categories: State Mutation, Causal Inference, Global Summary, and Visual Tracking, with 2-hop, 3-hop, and 4-hop evidence requirements. To ensure fair and controlled evaluation, all agents interact with LongVidSearch through a unified tool interface, which fixes the retrieval backend and isolates the agent's ability to formulate queries and plan iterative retrieval. In addition to answer accuracy, we measure tool-call cost to analyze the accuracy-efficiency trade-off under identical access conditions. We evaluate VideoAgent-style QA agents with multiple backbone LLMs using three-judge majority voting. GPT-5 achieves the highest accuracy (42.43), outperforming Gemini 3 Pro (30.97) and GPT-4o (19.20), yet remaining below 50 %, highlighting the difficulty of multi-hop retrieval planning. With gold evidence clips, performance becomes near-perfect, confirming retrieval planning as the primary bottleneck.
Abstract（参考訳）: ロングビデオ質問応答(Long-Video QA)は、長いビデオから証拠を回収するエージェントツールの使用にますます依存している。現実的な設定では、エージェントは複数の不連続なエビデンスクリップを反復的に収集する必要がある。しかし、既存のロングビデオベンチマークはほとんど静的であり、厳格なマルチホップ検索を強制することは滅多になく、典型的には標準的なエビデンスアクセスインタフェースが欠如しているため、検索計画の失敗を回答生成の失敗と区別することは困難である。このギャップに対処するために、標準化されたアクセス制約下での長編ビデオにおけるエージェント的マルチホップエビデンス検索計画を評価するベンチマークであるLongVidSearchを紹介した。 LongVidSearchは、検索の必要性を強制する: Hop-kの質問は、正確にk個のエビデンスクリップを必要とする。このベンチマークには、447本の長いビデオ(平均26分)に3000の質問が含まれており、ステートミューテーション、因果推論、グローバル概要、ビジュアルトラッキングの4つの推論カテゴリをカバーしており、2ホップ、3ホップ、4ホップのエビデンス要件がある。公正かつ制御された評価を保証するため、すべてのエージェントが統一されたツールインターフェースを通じてLongVidSearchと対話し、検索バックエンドを修正し、クエリを定式化し、反復検索を計画するエージェントの能力を分離する。回答の精度に加えて,ツールコールコストを測定し,同一アクセス条件下での精度・効率トレードオフを分析する。マルチバックボーンLDMを用いたビデオエージェント型QAエージェントについて, 多数投票による評価を行った。 GPT-5は、Gemini 3 Pro (30.97) と GPT-4o (19.20) を上回る最高精度 (42.43) を達成しているが、50%以下にとどまり、マルチホップ検索計画の難しさを強調している。金のエビデンスクリップでは、パフォーマンスはほぼ完璧になり、検索計画が主要なボトルネックであることを確認した。

論文の概要: LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

関連論文リスト