Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form
Sentences
- URL: http://arxiv.org/abs/2001.06891v3
- Date: Tue, 24 Mar 2020 21:34:44 GMT
- Title: Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form
Sentences
- Authors: Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, Lianli Gao
- Abstract summary: Given an un-trimmed video and a declarative/interrogative sentence, STVG aims to localize the-temporal tube of the object queried.
Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of novel object relationship modeling.
We present a declarative-Temporal Graph Reasoning Network (STGRN) for this task.
- Score: 107.0776836117313
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider a novel task, Spatio-Temporal Video Grounding for
Multi-Form Sentences (STVG). Given an untrimmed video and a
declarative/interrogative sentence depicting an object, STVG aims to localize
the spatio-temporal tube of the queried object. STVG has two challenging
settings: (1) We need to localize spatio-temporal object tubes from untrimmed
videos, where the object may only exist in a very small segment of the video;
(2) We deal with multi-form sentences, including the declarative sentences with
explicit objects and interrogative sentences with unknown objects. Existing
methods cannot tackle the STVG task due to the ineffective tube pre-generation
and the lack of object relationship modeling. Thus, we then propose a novel
Spatio-Temporal Graph Reasoning Network (STGRN) for this task. First, we build
a spatio-temporal region graph to capture the region relationships with
temporal object dynamics, which involves the implicit and explicit spatial
subgraphs in each frame and the temporal dynamic subgraph across frames. We
then incorporate textual clues into the graph and develop the multi-step
cross-modal graph reasoning. Next, we introduce a spatio-temporal localizer
with a dynamic selection method to directly retrieve the spatio-temporal tubes
without tube pre-generation. Moreover, we contribute a large-scale video
grounding dataset VidSTG based on video relation dataset VidOR. The extensive
experiments demonstrate the effectiveness of our method.
Related papers
- Described Spatial-Temporal Video Detection [33.69632963941608]
spatial-temporal video grounding (STVG) is formulated to only detect one pre-existing object in each frame.
In this work, we advance the STVG to a more practical setting called described spatial-temporal video detection (DSTVD)
DVD-ST supports grounding from none to many objects onto the video in response to queries.
arXiv Detail & Related papers (2024-07-08T04:54:39Z) - Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding [108.79026216923984]
Video grounding aims to localize a-temporal section in a video corresponding to an input text query.
This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task.
arXiv Detail & Related papers (2023-12-31T13:53:37Z) - Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z) - TubeDETR: Spatio-Temporal Video Grounding with Transformers [89.71617065426146]
We consider the problem of encoder localizing a-temporal tube in a video corresponding to a given text query.
To address this task, we propose TubeDETR, a transformer- architecture inspired by the recent success of such models for text-conditioned object detection.
arXiv Detail & Related papers (2022-03-30T16:31:49Z) - Spatial-Temporal Transformer for Dynamic Scene Graph Generation [34.190733855032065]
We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
arXiv Detail & Related papers (2021-07-26T16:30:30Z) - Human-centric Spatio-Temporal Video Grounding With Visual Transformers [70.50326310780407]
We introduce a novel task - Human Spatio-Temporal Video Grounding (HC-STVG)
HC-STVG aims to localize atemporal tube of the target person from an un video based on a given description.
We tackle this task by proposing an effective baseline method named S-Temporal Grounding with Visual Transformers (STGVT)
arXiv Detail & Related papers (2020-11-10T11:23:38Z) - Visual Relation Grounding in Videos [86.06874453626347]
We explore a novel named visual Relation Grounding in Videos (RGV)
This task aims at providing supportive visual facts for other video-language tasks (e.g., video grounding and video question answering)
We tackle challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical-temporal region.
Experimental results demonstrate our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts.
arXiv Detail & Related papers (2020-07-17T08:20:39Z) - Spatio-Temporal Ranked-Attention Networks for Video Captioning [34.05025890230047]
We propose a model that combines spatial and temporal attention to videos in two different orders.
We provide experiments on two benchmark datasets: MSVD and MSR-VTT.
Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.
arXiv Detail & Related papers (2020-01-17T01:00:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.