STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding
- URL: http://arxiv.org/abs/2502.20678v2
- Date: Sat, 05 Apr 2025 08:57:56 GMT
- Title: STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding
- Authors: Aaryan Garg, Akash Kumar, Yogesh S Rawat,
- Abstract summary: We study Weakly Supervised S-Temporal Video Grounding (WSTVG), a challenging task of localizing-Guidedly in videos.<n>Inspired by recent advances in vision-temporal foundation models, we investigate utility for WSTVG, leveraging zero-shot grounding capabilities.<n>To bridge this gap, we propose STPro, a novel progressive learning framework with two key modules.
- Score: 13.352635332422768
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally builds compositional action understanding, and (2) Congestion-Guided Spatial Curriculum Learning (CG-SCL), which adapts the model to complex scenes by spatially increasing task difficulty. STPro achieves state-of-the-art results on three benchmark datasets, with improvements of 1.0% on VidSTG-Declarative and 3.0% on HCSTVG-v1.
Related papers
- Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization [129.43937834515688]
We propose a new COllaborative Temporal consistEncy Learning (COTEL) framework to strengthen the video-language alignment.
Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs.
arXiv Detail & Related papers (2025-03-22T05:04:12Z) - Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding [24.650102499933514]
We focus on Weakly Supervised S-Temporal Video Grounding (WSTVG)<n>We first explore the potential of state-of-the-art object detection models for WSTVG.<n>Despite their robust zero-shot capabilities, our adaptation reveals significant limitations.<n>We propose CoSPaL, a novel approach which is designed to overcome these limitations.
arXiv Detail & Related papers (2025-01-28T16:25:10Z) - Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding [108.79026216923984]
Video grounding aims to localize a-temporal section in a video corresponding to an input text query.
This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task.
arXiv Detail & Related papers (2023-12-31T13:53:37Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z) - A Survey on Temporal Sentence Grounding in Videos [69.13365006222251]
Temporal sentence grounding in videos(TSGV) aims to localize one target segment from an untrimmed video with respect to a given sentence query.
To the best of our knowledge, this is the first systematic survey on temporal sentence grounding.
arXiv Detail & Related papers (2021-09-16T15:01:46Z) - Visual Relation Grounding in Videos [86.06874453626347]
We explore a novel named visual Relation Grounding in Videos (RGV)
This task aims at providing supportive visual facts for other video-language tasks (e.g., video grounding and video question answering)
We tackle challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical-temporal region.
Experimental results demonstrate our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts.
arXiv Detail & Related papers (2020-07-17T08:20:39Z) - Spatio-Temporal Ranked-Attention Networks for Video Captioning [34.05025890230047]
We propose a model that combines spatial and temporal attention to videos in two different orders.
We provide experiments on two benchmark datasets: MSVD and MSR-VTT.
Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.
arXiv Detail & Related papers (2020-01-17T01:00:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.