Video Abnormal Event Detection by Learning to Complete Visual Cloze
Tests
- URL: http://arxiv.org/abs/2108.02356v1
- Date: Thu, 5 Aug 2021 04:05:36 GMT
- Title: Video Abnormal Event Detection by Learning to Complete Visual Cloze
Tests
- Authors: Siqi Wang, Guang Yu, Zhiping Cai, Xinwang Liu, En Zhu, Jianping Yin,
Qing Liao
- Abstract summary: Video abnormal event (VAD) is a vital semi-supervised task that requires learning with only roughly labeled normal videos.
We propose a novel approach named visual cloze (VCC) which performs VAD by learning to complete "visual cloze tests" (VCTs)
We show that VCC achieves state-of-the-art VAD performance.
- Score: 50.1446994599891
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video abnormal event detection (VAD) is a vital semi-supervised task that
requires learning with only roughly labeled normal videos, as anomalies are
often practically unavailable. Although deep neural networks (DNNs) enable
great progress in VAD, existing solutions typically suffer from two issues: (1)
The precise and comprehensive localization of video events is ignored. (2) The
video semantics and temporal context are under-explored. To address those
issues, we are motivated by the prevalent cloze test in education and propose a
novel approach named visual cloze completion (VCC), which performs VAD by
learning to complete "visual cloze tests" (VCTs). Specifically, VCC first
localizes each video event and encloses it into a spatio-temporal cube (STC).
To achieve both precise and comprehensive localization, appearance and motion
are used as mutually complementary cues to mark the object region associated
with each video event. For each marked region, a normalized patch sequence is
extracted from temporally adjacent frames and stacked into the STC. By
comparing each patch and the patch sequence of a STC to a visual "word" and
"sentence" respectively, we can deliberately erase a certain "word" (patch) to
yield a VCT. DNNs are then trained to infer the erased patch by video
semantics, so as to complete the VCT. To fully exploit the temporal context,
each patch in STC is alternatively erased to create multiple VCTs, and the
erased patch's optical flow is also inferred to integrate richer motion clues.
Meanwhile, a new DNN architecture is designed as a model-level solution to
utilize video semantics and temporal context. Extensive experiments demonstrate
that VCC achieves state-of-the-art VAD performance. Our codes and results are
open at \url{https://github.com/yuguangnudt/VEC_VAD/tree/VCC}
Related papers
- Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Long-Short Temporal Co-Teaching for Weakly Supervised Video Anomaly
Detection [14.721615285883423]
Weakly supervised anomaly detection (WS-VAD) is a challenging problem that aims to learn VAD models only with video-level annotations.
Our proposed method is able to better deal with anomalies with varying durations as well as subtle anomalies.
arXiv Detail & Related papers (2023-03-31T13:28:06Z) - Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - End-to-End Dense Video Captioning with Parallel Decoding [53.34238344647624]
We propose a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC)
PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content.
experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results.
arXiv Detail & Related papers (2021-08-17T17:39:15Z) - Cloze Test Helps: Effective Video Anomaly Detection via Learning to
Complete Video Events [41.500063839748094]
anomaly detection (VAD) has made fruitful progress via deep neural network (DNN)
Inspired by frequently-used cloze test in language study, we propose a brand-new VAD solution named Video Event Completion (VEC)
VEC consistently outperform state-of-the-art methods by a notable margin (typically 1.5%-5% AUD) on commonly-used VAD benchmarks.
arXiv Detail & Related papers (2020-08-27T08:32:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.