Delving into CLIP latent space for Video Anomaly Recognition
- URL: http://arxiv.org/abs/2310.02835v1
- Date: Wed, 4 Oct 2023 14:01:55 GMT
- Title: Delving into CLIP latent space for Video Anomaly Recognition
- Authors: Luca Zanella, Benedetta Liberatori, Willi Menapace, Fabio Poiesi,
Yiming Wang, Elisa Ricci
- Abstract summary: We introduce the novel method AnomalyCLIP, the first to combine Large Language and Vision (LLV) models, such as CLIP.
Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace.
When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class.
- Score: 24.37974279994544
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We tackle the complex problem of detecting and recognising anomalies in
surveillance videos at the frame level, utilising only video-level supervision.
We introduce the novel method AnomalyCLIP, the first to combine Large Language
and Vision (LLV) models, such as CLIP, with multiple instance learning for
joint video anomaly detection and classification. Our approach specifically
involves manipulating the latent CLIP feature space to identify the normal
event subspace, which in turn allows us to effectively learn text-driven
directions for abnormal events. When anomalous frames are projected onto these
directions, they exhibit a large feature magnitude if they belong to a
particular class. We also introduce a computationally efficient Transformer
architecture to model short- and long-term temporal dependencies between
frames, ultimately producing the final anomaly score and class prediction
probabilities. We compare AnomalyCLIP against state-of-the-art methods
considering three major anomaly detection benchmarks, i.e. ShanghaiTech,
UCF-Crime, and XD-Violence, and empirically show that it outperforms baselines
in recognising video anomalies.
Related papers
- Temporal Divide-and-Conquer Anomaly Actions Localization in Semi-Supervised Videos with Hierarchical Transformer [0.9208007322096532]
Anomaly action detection and localization play an essential role in security and advanced surveillance systems.
We propose a hierarchical transformer model designed to evaluate the significance of observed actions in anomalous videos.
Our approach segments a parent video hierarchically into multiple temporal children instances and measures the influence of the children nodes in classifying the abnormality of the parent video.
arXiv Detail & Related papers (2024-08-24T18:12:58Z) - Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe.
Recent deep learning-based VAD models have shown promising results by generating high-resolution frames.
We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z) - Dynamic Erasing Network Based on Multi-Scale Temporal Features for
Weakly Supervised Video Anomaly Detection [103.92970668001277]
We propose a Dynamic Erasing Network (DE-Net) for weakly supervised video anomaly detection.
We first propose a multi-scale temporal modeling module, capable of extracting features from segments of varying lengths.
Then, we design a dynamic erasing strategy, which dynamically assesses the completeness of the detected anomalies.
arXiv Detail & Related papers (2023-12-04T09:40:11Z) - Open-Vocabulary Video Anomaly Detection [57.552523669351636]
Video anomaly detection (VAD) with weak supervision has achieved remarkable performance in utilizing video-level labels to discriminate whether a video frame is normal or abnormal.
Recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos.
This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies.
arXiv Detail & Related papers (2023-11-13T02:54:17Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised
Video Anomaly Detection [3.146076597280736]
Video anomaly detection (VAD) is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video.
We first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique.
Our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem.
arXiv Detail & Related papers (2022-12-09T22:28:24Z) - Convolutional Transformer based Dual Discriminator Generative
Adversarial Networks for Video Anomaly Detection [27.433162897608543]
We propose Conversaal Transformer based Dual Discriminator Generative Adrial Networks (CT-D2GAN) to perform unsupervised video anomaly detection.
It contains three key components, i., a convolutional encoder to capture the spatial information of input clips, a temporal self-attention module to encode the temporal dynamics and predict the future frame.
arXiv Detail & Related papers (2021-07-29T03:07:25Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z) - A Background-Agnostic Framework with Adversarial Training for Abnormal
Event Detection in Video [120.18562044084678]
Abnormal event detection in video is a complex computer vision problem that has attracted significant attention in recent years.
We propose a background-agnostic framework that learns from training videos containing only normal events.
arXiv Detail & Related papers (2020-08-27T18:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.