Multi-Contextual Predictions with Vision Transformer for Video Anomaly
Detection
- URL: http://arxiv.org/abs/2206.08568v1
- Date: Fri, 17 Jun 2022 05:54:31 GMT
- Title: Multi-Contextual Predictions with Vision Transformer for Video Anomaly
Detection
- Authors: Joo-Yeon Lee, Woo-Jeoung Nam, Seong-Whan Lee
- Abstract summary: understanding of thetemporal context of a video plays a vital role in anomaly detection.
We design a transformer model with three different contextual prediction streams: masked, whole and partial.
By learning to predict the missing frames of consecutive normal frames, our model can effectively learn various normality patterns in the video.
- Score: 22.098399083491937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Anomaly Detection(VAD) has been traditionally tackled in two main
methodologies: the reconstruction-based approach and the prediction-based one.
As the reconstruction-based methods learn to generalize the input image, the
model merely learns an identity function and strongly causes the problem called
generalizing issue. On the other hand, since the prediction-based ones learn to
predict a future frame given several previous frames, they are less sensitive
to the generalizing issue. However, it is still uncertain if the model can
learn the spatio-temporal context of a video. Our intuition is that the
understanding of the spatio-temporal context of a video plays a vital role in
VAD as it provides precise information on how the appearance of an event in a
video clip changes. Hence, to fully exploit the context information for anomaly
detection in video circumstances, we designed the transformer model with three
different contextual prediction streams: masked, whole and partial. By learning
to predict the missing frames of consecutive normal frames, our model can
effectively learn various normality patterns in the video, which leads to a
high reconstruction error at the abnormal cases that are unsuitable to the
learned context. To verify the effectiveness of our approach, we assess our
model on the public benchmark datasets: USCD Pedestrian 2, CUHK Avenue and
ShanghaiTech and evaluate the performance with the anomaly score metric of
reconstruction error. The results demonstrate that our proposed approach
achieves a competitive performance compared to the existing video anomaly
detection methods.
Related papers
- Let Video Teaches You More: Video-to-Image Knowledge Distillation using DEtection TRansformer for Medical Video Lesion Detection [91.97935118185]
We propose Video-to-Image knowledge distillation for the task of medical video lesion detection.
By distilling multi-frame contexts into a single frame, the proposed V2I-DETR combines the advantages of utilizing temporal contexts from video-based models and the inference speed of image-based models.
V2I-DETR outperforms previous state-of-the-art methods by a large margin while achieving the real-time inference speed (30 FPS) as the image-based model.
arXiv Detail & Related papers (2024-08-26T07:17:05Z) - Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM [35.06386971859359]
Holmes-VAD is a novel framework that leverages precise temporal supervision and rich multimodal instructions.
We construct the first large-scale multimodal VAD instruction-tuning benchmark, VAD-Instruct50k.
Building upon the VAD-Instruct50k dataset, we develop a customized solution for interpretable video anomaly detection.
arXiv Detail & Related papers (2024-06-18T03:19:24Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.
Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.
We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe.
Recent deep learning-based VAD models have shown promising results by generating high-resolution frames.
We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z) - Future Video Prediction from a Single Frame for Video Anomaly Detection [0.38073142980732994]
Video anomaly detection (VAD) is an important but challenging task in computer vision.
We introduce the task of future frame prediction proxy-task, as a novel proxy-task for video anomaly detection.
This proxy-task alleviates the challenges of previous methods in learning longer motion patterns.
arXiv Detail & Related papers (2023-08-15T14:04:50Z) - Making Reconstruction-based Method Great Again for Video Anomaly
Detection [64.19326819088563]
Anomaly detection in videos is a significant yet challenging problem.
Existing reconstruction-based methods rely on old-fashioned convolutional autoencoders.
We propose a new autoencoder model for enhanced consecutive frame reconstruction.
arXiv Detail & Related papers (2023-01-28T01:57:57Z) - Convolutional Transformer based Dual Discriminator Generative
Adversarial Networks for Video Anomaly Detection [27.433162897608543]
We propose Conversaal Transformer based Dual Discriminator Generative Adrial Networks (CT-D2GAN) to perform unsupervised video anomaly detection.
It contains three key components, i., a convolutional encoder to capture the spatial information of input clips, a temporal self-attention module to encode the temporal dynamics and predict the future frame.
arXiv Detail & Related papers (2021-07-29T03:07:25Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z) - Consistency Guided Scene Flow Estimation [159.24395181068218]
CGSF is a self-supervised framework for the joint reconstruction of 3D scene structure and motion from stereo video.
We show that the proposed model can reliably predict disparity and scene flow in challenging imagery.
It achieves better generalization than the state-of-the-art, and adapts quickly and robustly to unseen domains.
arXiv Detail & Related papers (2020-06-19T17:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.