Fugu-MT 論文翻訳(概要): SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

論文の概要: SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

arxiv url: http://arxiv.org/abs/2510.13016v1
Date: Tue, 14 Oct 2025 22:10:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.434384
Title: SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
Title（参考訳）: SVAG-Bench:マルチインスタンス時空間ビデオアクショングラウンドのための大規模ベンチマーク
Authors: Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljoša Ošep, Laura Leal-Taixé, Thomas Seidl,
Abstract要約: 本研究では,ビデオ中のすべての参照オブジェクトを同時に検出,追跡,時間的ローカライズするモデルを必要とする新しいタスクである,SVAG(Spatio-temporal Video Action Grounding)を紹介する。 SVAG-Benchは688の動画、19,590の注釈付きレコード、903のユニークな動詞からなる大規模ベンチマークである。実験の結果、既存のモデルではSVAG、特に密集したシーンや複雑なシーンでは性能が良くないことがわかった。
参考スコア（独自算出の注目度）: 48.64661382961745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding fine-grained actions and accurately localizing their corresponding actors in space and time are fundamental capabilities for advancing next-generation AI systems, including embodied agents, autonomous platforms, and human-AI interaction frameworks. Despite recent progress in video understanding, existing methods predominantly address either coarse-grained action recognition or generic object tracking, thereby overlooking the challenge of jointly detecting and tracking multiple objects according to their actions while grounding them temporally. To address this gap, we introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos based on natural language descriptions of their actions. To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs, covering a diverse range of objects, actions, and real-world scenes. We further propose SVAGFormer, a baseline framework that adapts state of the art vision language models for joint spatial and temporal grounding, and introduce SVAGEval, a standardized evaluation toolkit for fair and reproducible benchmarking. Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes, underscoring the need for more advanced reasoning over fine-grained object-action interactions in long videos.
Abstract（参考訳）: きめ細かいアクションを理解し、その対応するアクターを空間と時間で正確にローカライズすることは、エンボディエージェント、自律プラットフォーム、人間とAIのインタラクションフレームワークを含む、次世代AIシステムを進化させる基本的な能力である。近年の映像理解の進歩にもかかわらず、既存の手法は主に粗粒度の動作認識や汎用物体追跡に対処し、時間的に接地しながら複数の物体を協調して検出・追跡するという課題を克服する。このギャップに対処するために,本研究では,アクションの自然言語記述に基づくビデオ内のすべての参照オブジェクトを同時に検出,追跡,時間的ローカライズするモデルを必要とする新しいタスクである,SVAG(Spatio-temporal Video Action Grounding)を紹介する。 SVAG-Benchは688の動画、19,590の注釈付きレコード、903のユニークな動詞で構成され、多様なオブジェクト、アクション、現実世界のシーンをカバーしている。さらに,共同空間と時間的接地のために最先端の視覚言語モデルを適用するベースラインフレームワークであるSVAGFormerを提案し,公正かつ再現可能なベンチマークのための標準化された評価ツールキットであるSVAGEvalを紹介した。実験の結果、既存のモデルはSVAG、特に密集したシーンや複雑なシーンでは性能が悪く、長いビデオにおける細粒度のオブジェクト・アクション・インタラクションよりも高度な推論の必要性が強調された。

論文の概要: SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

関連論文リスト