Fugu-MT 論文翻訳(概要): VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

論文の概要: VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

arxiv url: http://arxiv.org/abs/2605.01391v1
Date: Sat, 02 May 2026 11:28:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.745401
Title: VISTA: Video Interaction Spatio-Temporal Analysis Benchmark
Title（参考訳）: VISTA:ビデオインタラクションの時空間分析ベンチマーク
Authors: Alejandro Aparcedo, Akash Kumar, Aaryan Garg, Dalton Pham, Wen-Kai Chen, Anirudh Bharadwaj, Aman Chadha, Yogesh Rawat,
Abstract要約: 視覚言語モデル(VLM)におけるオープンセット・マルチアクション時間的理解のためのベンチマークであるVISTAを紹介する。我々のベンチマークでは、複数のデータセットを単一のインタラクション対応ベンチマークに統合し、12Kのキュレートされたビデオペアで構成されています。全体として、VISTAはVLMにおける時間的理解のための、最初の大規模かつ対話対応の診断ベンチマークである。
参考スコア（独自算出の注目度）: 47.42490151556478
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.
Abstract（参考訳）: 既存のVision-Language Models (VLM) のベンチマークでは、単純なシングルアクションビデオ、クローズド属性セット、制限されたエンティティタイプに対する時空間的理解を主に評価し、実世界のビデオ理解を特徴付ける多様なエンティティ間の自由形式のマルチアクションインタラクションを捉えなかった。さらに、相補的な時空間軸間のモデル故障を分析するための体系的な枠組みが欠如していることは、包括的な評価を妨げている。これらのギャップに対処するために、VISTAは、VLMにおけるオープンセット、マルチエンタリティ、マルチアクション時空間理解のために設計されたビデオインタラクション時空間分析のベンチマークである。 VISTAは、ビデオを解釈可能なエンティティ、それらの関連するアクション、およびリレーショナルダイナミクスに分解し、多軸診断とリレーショナル、空間的、時間的理解の統一的な評価を可能にする。我々のベンチマークでは、複数のデータセットを単一のインタラクション対応分類に統合し、多様なシーンと複雑さにまたがるビデオクエリペアを約12Kのキュレートする。 VISTAの11の最先端VLMを体系的に評価し,従来の指標から明らかな欠点と時空間偏差を明らかにするために分類の総合的な性能を分解した。挑戦的なデータセット上で詳細な分類駆動診断を提供することで、VISTAは、モデル設計、事前学習戦略、評価プロトコルの進歩を導くための微妙なフレームワークを提供する。全体として、VISTAはVLMの時空間的理解のための大規模な対話型診断ベンチマークとしては初めてである。

論文の概要: VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

関連論文リスト