SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
- URL: http://arxiv.org/abs/2510.13016v2
- Date: Thu, 16 Oct 2025 15:16:51 GMT
- Title: SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
- Authors: Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljoša Ošep, Laura Leal-Taixé, Thomas Seidl,
- Abstract summary: We introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos.<n>To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs.<n> Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes.
- Score: 48.64661382961745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding fine-grained actions and accurately localizing their corresponding actors in space and time are fundamental capabilities for advancing next-generation AI systems, including embodied agents, autonomous platforms, and human-AI interaction frameworks. Despite recent progress in video understanding, existing methods predominantly address either coarse-grained action recognition or generic object tracking, thereby overlooking the challenge of jointly detecting and tracking multiple objects according to their actions while grounding them temporally. To address this gap, we introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos based on natural language descriptions of their actions. To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs, covering a diverse range of objects, actions, and real-world scenes. We further propose SVAGFormer, a baseline framework that adapts state of the art vision language models for joint spatial and temporal grounding, and introduce SVAGEval, a standardized evaluation toolkit for fair and reproducible benchmarking. Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes, underscoring the need for more advanced reasoning over fine-grained object-action interactions in long videos.
Related papers
- Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video? [1.1288535170985818]
We introduce AAG, a method for Action Anticipation at a Glimpse.<n>AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning.<n>Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively.
arXiv Detail & Related papers (2025-12-02T14:57:17Z) - VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z) - Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation [1.6584112749108326]
TCDSG, Temporally Consistent Dynamic Scene Graphs, is an end-to-end framework that detects, tracks, and links subject-object relationships across time.<n>Our work sets a new standard in multi-frame video analysis, opening new avenues for high-impact applications in surveillance, autonomous navigation, and beyond.
arXiv Detail & Related papers (2024-12-03T20:19:20Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.<n>This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Spatial-Temporal Multi-level Association for Video Object Segmentation [89.32226483171047]
This paper proposes spatial-temporal multi-level association, which jointly associates reference frame, test frame, and object features.
Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features.
arXiv Detail & Related papers (2024-04-09T12:44:34Z) - REACT: Recognize Every Action Everywhere All At Once [8.10024991952397]
Group Activity Decoder (GAR) is a fundamental problem in computer vision, with diverse applications in sports analysis, surveillance, and social scene understanding.
We present REACT, an architecture inspired by the transformer encoder-decoder model.
Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities.
arXiv Detail & Related papers (2023-11-27T20:48:54Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Tracking Objects and Activities with Attention for Temporal Sentence
Grounding [51.416914256782505]
Temporal sentence (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed segment.
We propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal and search space, and (B) a Temporal Sentence Tracker to track multi-modal targets' behavior and to predict query-related segment.
arXiv Detail & Related papers (2023-02-21T16:42:52Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.