Cloze Test Helps: Effective Video Anomaly Detection via Learning to
Complete Video Events
- URL: http://arxiv.org/abs/2008.11988v1
- Date: Thu, 27 Aug 2020 08:32:51 GMT
- Title: Cloze Test Helps: Effective Video Anomaly Detection via Learning to
Complete Video Events
- Authors: Guang Yu, Siqi Wang, Zhiping Cai, En Zhu, Chuanfu Xu, Jianping Yin,
Marius Kloft
- Abstract summary: anomaly detection (VAD) has made fruitful progress via deep neural network (DNN)
Inspired by frequently-used cloze test in language study, we propose a brand-new VAD solution named Video Event Completion (VEC)
VEC consistently outperform state-of-the-art methods by a notable margin (typically 1.5%-5% AUD) on commonly-used VAD benchmarks.
- Score: 41.500063839748094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a vital topic in media content interpretation, video anomaly detection
(VAD) has made fruitful progress via deep neural network (DNN). However,
existing methods usually follow a reconstruction or frame prediction routine.
They suffer from two gaps: (1) They cannot localize video activities in a both
precise and comprehensive manner. (2) They lack sufficient abilities to utilize
high-level semantics and temporal context information. Inspired by
frequently-used cloze test in language study, we propose a brand-new VAD
solution named Video Event Completion (VEC) to bridge gaps above: First, we
propose a novel pipeline to achieve both precise and comprehensive enclosure of
video activities. Appearance and motion are exploited as mutually complimentary
cues to localize regions of interest (RoIs). A normalized spatio-temporal cube
(STC) is built from each RoI as a video event, which lays the foundation of VEC
and serves as a basic processing unit. Second, we encourage DNN to capture
high-level semantics by solving a visual cloze test. To build such a visual
cloze test, a certain patch of STC is erased to yield an incomplete event (IE).
The DNN learns to restore the original video event from the IE by inferring the
missing patch. Third, to incorporate richer motion dynamics, another DNN is
trained to infer erased patches' optical flow. Finally, two ensemble strategies
using different types of IE and modalities are proposed to boost VAD
performance, so as to fully exploit the temporal context and modality
information for VAD. VEC can consistently outperform state-of-the-art methods
by a notable margin (typically 1.5%-5% AUROC) on commonly-used VAD benchmarks.
Our codes and results can be verified at github.com/yuguangnudt/VEC_VAD.
Related papers
- Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Video Event Restoration Based on Keyframes for Video Anomaly Detection [9.18057851239942]
Existing deep neural network based anomaly detection (VAD) methods mostly follow the route of frame reconstruction or frame prediction.
We introduce a brand-new VAD paradigm to break through these limitations.
We propose a novel U-shaped Swin Transformer Network with Dual Skip Connections (USTN-DSC) for video event restoration.
arXiv Detail & Related papers (2023-04-11T10:13:19Z) - Long-Short Temporal Co-Teaching for Weakly Supervised Video Anomaly
Detection [14.721615285883423]
Weakly supervised anomaly detection (WS-VAD) is a challenging problem that aims to learn VAD models only with video-level annotations.
Our proposed method is able to better deal with anomalies with varying durations as well as subtle anomalies.
arXiv Detail & Related papers (2023-03-31T13:28:06Z) - Leveraging the Video-level Semantic Consistency of Event for
Audio-visual Event Localization [8.530561069113716]
We propose a novel video-level semantic consistency guidance network for the AVE localization task.
It consists of two components: a cross-modal event representation extractor and an intra-modal semantic consistency enhancer.
We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-10-11T08:15:57Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - Video Abnormal Event Detection by Learning to Complete Visual Cloze
Tests [50.1446994599891]
Video abnormal event (VAD) is a vital semi-supervised task that requires learning with only roughly labeled normal videos.
We propose a novel approach named visual cloze (VCC) which performs VAD by learning to complete "visual cloze tests" (VCTs)
We show that VCC achieves state-of-the-art VAD performance.
arXiv Detail & Related papers (2021-08-05T04:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.