Fugu-MT 論文翻訳(概要): TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

論文の概要: TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

arxiv url: http://arxiv.org/abs/2605.07593v1
Date: Fri, 08 May 2026 11:06:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.010352
Title: TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
Title（参考訳）: TraceAV-Bench:ロングオーディオ映像上でのマルチホップ軌道推論のベンチマーク
Authors: Hengyi Feng, Hao Liang, Mingrui Chen, Bohan Zeng, Meiyi Qiang, Zhengyang Zhao, Zimo Meng, Zeang Sheng, Wentao Zhang,
Abstract要約: 実世界の音声・視覚的理解には、疎く、時間的に分散し、視覚と聴覚の流れにまたがる証拠の連鎖が必要である。 TraceAV-Benchは、長時間の視覚的軌跡とマルチモーダル幻覚の堅牢性に対して、マルチホップ推論を共同で評価する最初のベンチマークである。
参考スコア（独自算出の注目度）: 13.9567665031159
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-world audio-visual understanding requires chaining evidence that is sparse, temporally dispersed, and split across the visual and auditory streams, whereas existing benchmarks largely fail to evaluate this capability. They restrict videos to short clips, isolate modalities, or reduce questions to one-hop perception. We introduce TraceAV-Bench, the first benchmark to jointly evaluate multi-hop reasoning over long audio-visual trajectories and multimodal hallucination robustness. TraceAV-Bench comprises 2,200 rigorously validated multiple-choice questions over 578 long videos, totaling 339.5 hours, spanning 4 evaluation dimensions and 15 sub-tasks. Each question is grounded in an explicit reasoning chain that averages 3.68 hops across a 15.1-minute temporal span. The dataset is built by a three-step semi-automated pipeline followed by a strict quality assurance process. Evaluation of multiple representative OmniLLMs on TraceAV-Bench reveals that the benchmark poses a persistent challenge across all models, with the strongest closed-source model (Gemini 3.1 Pro) reaching only 68.29% on general tasks, and the best open-source model (Ming-Flash-Omni-2.0) reaching 51.70%, leaving substantial headroom. Moreover, we find that robustness to multimodal hallucination is largely decoupled from general multimodal reasoning performance. We anticipate that TraceAV-Bench will stimulate further research toward OmniLLMs that can reason coherently and faithfully over long-form audio-visual content.
Abstract（参考訳）: 実世界の音声・視覚的理解には、視覚的および聴覚的ストリームに散らばり、時間的に分散し、分割される証拠の連鎖が必要であるが、既存のベンチマークではこの機能の評価にほとんど失敗している。ビデオはショートクリップに制限され、モダリティを分離したり、質問をワンホップの知覚に還元する。 TraceAV-Benchは、長時間の視覚的軌跡とマルチモーダル幻覚の堅牢性に対して、マルチホップ推論を共同で評価する最初のベンチマークである。 TraceAV-Benchは、2,200の厳格に検証された複数選択の質問を578本以上のビデオで行い、合計で339.5時間、4つの評価次元と15のサブタスクにまたがる。それぞれの質問は、15.1分間の時間間隔で平均3.68ホップの明示的な推論連鎖に基礎を置いている。データセットは、3ステップの半自動パイプラインと厳格な品質保証プロセスによって構築される。 TraceAV-Bench上での複数の代表OmniLLMの評価によると、ベンチマークはすべてのモデルに対して永続的な課題を呈しており、最も強力なクローズドソースモデル(Gemini 3.1 Pro)は一般的なタスクで68.29%、最高のオープンソースモデル(Ming-Flash-Omni-2.0)は51.70%、実質的なヘッドルームを残している。さらに,マルチモーダル幻覚に対するロバスト性は,一般的なマルチモーダル推論性能から大きく切り離されていることがわかった。我々はTraceAV-Benchが、ロングフォームオーディオ・ビジュアルコンテンツに対して一貫性と忠実に推論できるOmniLLMsに対するさらなる研究を促進することを期待する。

論文の概要: TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

関連論文リスト