Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance
Video
- URL: http://arxiv.org/abs/2108.03825v1
- Date: Mon, 9 Aug 2021 06:11:14 GMT
- Title: Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance
Video
- Authors: Jie Wu, Wei Zhang, Guanbin Li, Wenhao Wu, Xiao Tan, Yingying Li, Errui
Ding, Liang Lin
- Abstract summary: We introduce Weakly-Supervised Snoma-Temporally Detection (WSSTAD) in surveillance video.
WSSTAD aims to localize a-temporal tube (i.e. sequence of bounding boxes at consecutive times) that encloses abnormal event.
We propose a dual-branch network which takes as input proposals with multi-granularities in both spatial-temporal domains.
- Score: 128.41392860714635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce a novel task, referred to as Weakly-Supervised
Spatio-Temporal Anomaly Detection (WSSTAD) in surveillance video. Specifically,
given an untrimmed video, WSSTAD aims to localize a spatio-temporal tube (i.e.,
a sequence of bounding boxes at consecutive times) that encloses the abnormal
event, with only coarse video-level annotations as supervision during training.
To address this challenging task, we propose a dual-branch network which takes
as input the proposals with multi-granularities in both spatial-temporal
domains. Each branch employs a relationship reasoning module to capture the
correlation between tubes/videolets, which can provide rich contextual
information and complex entity relationships for the concept learning of
abnormal behaviors. Mutually-guided Progressive Refinement framework is set up
to employ dual-path mutual guidance in a recurrent manner, iteratively sharing
auxiliary supervision information across branches. It impels the learned
concepts of each branch to serve as a guide for its counterpart, which
progressively refines the corresponding branch and the whole framework.
Furthermore, we contribute two datasets, i.e., ST-UCF-Crime and STRA,
consisting of videos containing spatio-temporal abnormal annotations to serve
as the benchmarks for WSSTAD. We conduct extensive qualitative and quantitative
evaluations to demonstrate the effectiveness of the proposed approach and
analyze the key factors that contribute more to handle this task.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
Grounding [35.73830796500975]
We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT)
To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling.
Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
arXiv Detail & Related papers (2022-09-27T11:13:04Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Unpaired Adversarial Learning for Single Image Deraining with Rain-Space
Contrastive Constraints [61.40893559933964]
We develop an effective unpaired SID method which explores mutual properties of the unpaired exemplars by a contrastive learning manner in a GAN framework, named as CDR-GAN.
Our method performs favorably against existing unpaired deraining approaches on both synthetic and real-world datasets, even outperforms several fully-supervised or semi-supervised models.
arXiv Detail & Related papers (2021-09-07T10:00:45Z) - Joint Inductive and Transductive Learning for Video Object Segmentation [107.32760625159301]
Semi-supervised object segmentation is a task of segmenting the target object in a video sequence given only a mask in the first frame.
Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning.
We propose to integrate transductive and inductive learning into a unified framework to exploit complement between them for accurate and robust video object segmentation.
arXiv Detail & Related papers (2021-08-08T16:25:48Z) - Self-Supervised Learning for Semi-Supervised Temporal Action Proposal [42.6254639252739]
We design an effective Self-supervised Semi-supervised Temporal Action Proposal (SSTAP) framework.
The SSTAP contains two crucial branches, i.e., temporal-aware semi-supervised branch and relation-aware self-supervised branch.
We extensively evaluate the proposed SSTAP on THUMOS14 and ActivityNet v1.3 datasets.
arXiv Detail & Related papers (2021-04-07T16:03:25Z) - Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips.
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA.
Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.