Human-centric Spatio-Temporal Video Grounding With Visual Transformers
- URL: http://arxiv.org/abs/2011.05049v2
- Date: Wed, 2 Jun 2021 06:51:34 GMT
- Title: Human-centric Spatio-Temporal Video Grounding With Visual Transformers
- Authors: Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu
Jiang, Qian Yu, Dong Xu
- Abstract summary: We introduce a novel task - Human Spatio-Temporal Video Grounding (HC-STVG)
HC-STVG aims to localize atemporal tube of the target person from an un video based on a given description.
We tackle this task by proposing an effective baseline method named S-Temporal Grounding with Visual Transformers (STGVT)
- Score: 70.50326310780407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we introduce a novel task - Humancentric Spatio-Temporal Video
Grounding (HC-STVG). Unlike the existing referring expression tasks in images
or videos, by focusing on humans, HC-STVG aims to localize a spatiotemporal
tube of the target person from an untrimmed video based on a given textural
description. This task is useful, especially for healthcare and
security-related applications, where the surveillance videos can be extremely
long but only a specific person during a specific period of time is concerned.
HC-STVG is a video grounding task that requires both spatial (where) and
temporal (when) localization. Unfortunately, the existing grounding methods
cannot handle this task well. We tackle this task by proposing an effective
baseline method named Spatio-Temporal Grounding with Visual Transformers
(STGVT), which utilizes Visual Transformers to extract cross-modal
representations for video-sentence matching and temporal localization. To
facilitate this task, we also contribute an HC-STVG dataset consisting of 5,660
video-sentence pairs on complex multi-person scenes. Specifically, each video
lasts for 20 seconds, pairing with a natural query sentence with an average of
17.25 words. Extensive experiments are conducted on this dataset, demonstrating
the newly-proposed method outperforms the existing baseline methods.
Related papers
- ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models [53.9661582975843]
Video Temporal Grounding aims to ground specific segments within an untrimmed video corresponding to a given natural language query.
Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive and prone to human biases.
We present ChatVTG, a novel approach that utilizes Video Dialogue Large Language Models (LLMs) for zero-shot video temporal grounding.
arXiv Detail & Related papers (2024-10-01T08:27:56Z) - Described Spatial-Temporal Video Detection [33.69632963941608]
spatial-temporal video grounding (STVG) is formulated to only detect one pre-existing object in each frame.
In this work, we advance the STVG to a more practical setting called described spatial-temporal video detection (DSTVD)
DVD-ST supports grounding from none to many objects onto the video in response to queries.
arXiv Detail & Related papers (2024-07-08T04:54:39Z) - What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [55.574102714832456]
spatial-temporal grounding describes the task of localizing events in space and time.
Models for this task are usually trained with human-annotated sentences and bounding box supervision.
We combine local representation learning, which focuses on fine-grained spatial information, with a global representation that captures higher-level representations.
arXiv Detail & Related papers (2023-03-29T19:38:23Z) - TubeDETR: Spatio-Temporal Video Grounding with Transformers [89.71617065426146]
We consider the problem of encoder localizing a-temporal tube in a video corresponding to a given text query.
To address this task, we propose TubeDETR, a transformer- architecture inspired by the recent success of such models for text-conditioned object detection.
arXiv Detail & Related papers (2022-03-30T16:31:49Z) - Visual Relation Grounding in Videos [86.06874453626347]
We explore a novel named visual Relation Grounding in Videos (RGV)
This task aims at providing supportive visual facts for other video-language tasks (e.g., video grounding and video question answering)
We tackle challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical-temporal region.
Experimental results demonstrate our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts.
arXiv Detail & Related papers (2020-07-17T08:20:39Z) - Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form
Sentences [107.0776836117313]
Given an un-trimmed video and a declarative/interrogative sentence, STVG aims to localize the-temporal tube of the object queried.
Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of novel object relationship modeling.
We present a declarative-Temporal Graph Reasoning Network (STGRN) for this task.
arXiv Detail & Related papers (2020-01-19T19:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.