Fugu-MT 論文翻訳(概要): SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

論文の概要: SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

arxiv url: http://arxiv.org/abs/2605.08412v1
Date: Fri, 08 May 2026 19:20:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.625221
Title: SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding
Title（参考訳）: SynCR: 合成グラウンドを用いたクロスビデオ推論ベンチマーク
Authors: Sara Ghazanfari, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami,
Abstract要約: MLLM(Multimodal Large Language Models)は、シングルビデオ理解において急速に進歩しているが、複数の独立したビデオストリームをまたいで推論する能力はいまだによく分かっていない。そこで,SynCRは,地上検定によるクロスビデオ推論のための制御された総合的ベンチマークである。オープンおよびクローズドウェイトMLLMのゼロショット評価は、現在のモデルと人間の間に大きなギャップがあることを明らかにする。
参考スコア（独自算出の注目度）: 20.256916516259782
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have made rapid progress in single-video understanding, yet their ability to reason across multiple independent video streams remains poorly understood. Existing multi-video benchmarks rely largely on human-annotated real-world footage, limiting the precision of spatial, temporal, and physical ground truth and making it difficult to diagnose model failures. We introduce SYNCR, a controlled synthetic benchmark for cross-video reasoning with programmatically verified grounding. Built using Habitat, Kubric, and CLEVRER simulator engines, SYNCR contains 8,163 multi-video question-answer pairs grounded in 9,650 unique videos. It evaluates MLLMs across eight tasks spanning four diagnostic pillars: Temporal Alignment, Spatial Tracking, Comparative Reasoning, and Holistic Synthesis. Our zero-shot evaluation of leading open- and closed-weight MLLMs reveals a substantial gap between current models and humans: the best model achieves only 52.5% average accuracy, compared to an 89.5% human baseline. Models perform relatively well on temporal ordering but struggle with precise physical and spatial reasoning, with the best model reaching only 26.0% accuracy on Kinematic Comparison. We further find that parameter scaling and reasoning-specialized post-training improve temporal alignment capabilities, but do not reliably address fine-grained physical tracking or global spatial synthesis. Finally, an exploratory sim-to-real correlation analysis suggests that several SYNCR tasks track model-level trends on real-world multi-video benchmarks, while also exposing reasoning capabilities underrepresented by existing evaluations. Code available at https://github.com/SaraGhazanfari/SYNCR.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、シングルビデオ理解において急速に進歩しているが、複数の独立したビデオストリームをまたいで推論する能力はいまだによく分かっていない。既存のマルチビデオベンチマークは、人間の注釈付き実世界の映像に大きく依存しており、空間的、時間的、物理的地上の真実の正確さを制限し、モデルの失敗の診断を困難にしている。我々は,プログラムで検証されたグラウンドを用いたクロスビデオ推論のための制御された合成ベンチマークであるSynCRを紹介する。 SynCRは、Habitat、Kubric、CLEVRERシミュレーターエンジンを使って構築され、9,650のユニークなビデオで8,163組のマルチビデオ質問応答対を含んでいる。 MLLMは、時間的アライメント、空間的トラッキング、比較推論、ホロスティック合成の4つの診断柱にまたがる8つのタスクにまたがるMLLMを評価する。先進的なオープンウェイトとクローズドウェイトMLLMのゼロショット評価では、現在のモデルと人間の間に大きなギャップがあることが示されています。モデルは時間的順序付けにおいて比較的よく機能するが、正確な物理的推論と空間的推論に苦慮し、最良のモデルは運動学的比較において26.0%の精度にしか達しない。さらに、パラメータスケーリングと推論後訓練により時間的アライメント能力は向上するが、微粒な物理追跡や大域的な空間合成には確実に対応しないことがわかった。最後に、探索的なsim-to-real相関分析により、いくつかのSynCRタスクが実世界のマルチビデオベンチマークのモデルレベルのトレンドを追跡し、既存の評価で表現できない推論能力を公開することを示唆している。コードはhttps://github.com/SaraGhazanfari/SYNCRで公開されている。

論文の概要: SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

関連論文リスト