Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
Grounding
- URL: http://arxiv.org/abs/2209.13306v1
- Date: Tue, 27 Sep 2022 11:13:04 GMT
- Title: Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
Grounding
- Authors: Yang Jin, Yongzhi Li, Zehuan Yuan, Yadong Mu
- Abstract summary: We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT)
To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling.
Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
- Score: 35.73830796500975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatio-Temporal video grounding (STVG) focuses on retrieving the
spatio-temporal tube of a specific object depicted by a free-form textual
expression. Existing approaches mainly treat this complicated task as a
parallel frame-grounding problem and thus suffer from two types of
inconsistency drawbacks: feature alignment inconsistency and prediction
inconsistency. In this paper, we present an end-to-end one-stage framework,
termed Spatio-Temporal Consistency-Aware Transformer (STCAT), to alleviate
these issues. Specially, we introduce a novel multi-modal template as the
global objective to address this task, which explicitly constricts the
grounding region and associates the predictions among all video frames.
Moreover, to generate the above template under sufficient video-textual
perception, an encoder-decoder architecture is proposed for effective global
context modeling. Thanks to these critical designs, STCAT enjoys more
consistent cross-modal feature alignment and tube prediction without reliance
on any pre-trained object detectors. Extensive experiments show that our method
outperforms previous state-of-the-arts with clear margins on two challenging
video benchmarks (VidSTG and HC-STVG), illustrating the superiority of the
proposed framework to better understanding the association between vision and
natural language. Code is publicly available at
\url{https://github.com/jy0205/STCAT}.
Related papers
- Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - OED: Towards One-stage End-to-End Dynamic Scene Graph Generation [18.374354844446962]
Dynamic Scene Graph Generation (DSGG) focuses on identifying visual relationships within the spatial-temporal domain of videos.
We propose a one-stage end-to-end framework, termed OED, which streamlines the DSGG pipeline.
This framework reformulates the task as a set prediction problem and leverages pair-wise features to represent each subject-object pair within the scene graph.
arXiv Detail & Related papers (2024-05-27T08:18:41Z) - Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z) - SOC: Semantic-Assisted Object Cluster for Referring Video Object
Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment.
We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - TubeDETR: Spatio-Temporal Video Grounding with Transformers [89.71617065426146]
We consider the problem of encoder localizing a-temporal tube in a video corresponding to a given text query.
To address this task, we propose TubeDETR, a transformer- architecture inspired by the recent success of such models for text-conditioned object detection.
arXiv Detail & Related papers (2022-03-30T16:31:49Z) - Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance
Video [128.41392860714635]
We introduce Weakly-Supervised Snoma-Temporally Detection (WSSTAD) in surveillance video.
WSSTAD aims to localize a-temporal tube (i.e. sequence of bounding boxes at consecutive times) that encloses abnormal event.
We propose a dual-branch network which takes as input proposals with multi-granularities in both spatial-temporal domains.
arXiv Detail & Related papers (2021-08-09T06:11:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.