No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection
- URL: http://arxiv.org/abs/2602.19248v1
- Date: Sun, 22 Feb 2026 16:03:43 GMT
- Title: No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection
- Authors: Zunkai Dai, Ke Li, Jiajia Liu, Jie Yang, Yuanyuan Qiao,
- Abstract summary: Existing video anomaly detection methods under perform in open-world scenarios.<n>Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics.<n>We propose LAVIDA, an end-to-end zero-shot video anomaly detection framework.
- Score: 15.949619310702579
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo-anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting. Our code is available in https://github.com/VitaminCreed/LAVIDA.
Related papers
- Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection [52.5174167737992]
Video anomaly detection (VAD) aims to identify abnormal events in videos.<n>We propose SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations.<n>Our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data.
arXiv Detail & Related papers (2026-02-27T13:48:50Z) - Track Any Anomalous Object: A Granular Video Anomaly Detection Pipeline [63.96226274616927]
A new framework called Track Any Anomalous Object (TAO) introduces a granular video anomaly detection pipeline.<n>Unlike methods that assign anomaly scores to every pixel, our approach transforms the problem into pixel-level tracking of anomalous objects.<n>Experiments demonstrate that TAO sets new benchmarks in accuracy and robustness.
arXiv Detail & Related papers (2025-06-05T15:49:39Z) - Language-guided Open-world Video Anomaly Detection under Weak Supervision [27.912128185225054]
Video anomaly detection (VAD) aims to detect anomalies that deviate from what is expected.<n>Existing methods assume that the definition of anomalies is invariable, and thus are not applicable to the open world.<n>We propose a novel open-world VAD paradigm with variable definitions, allowing guided detection through user-provided natural language at inference time.
arXiv Detail & Related papers (2025-03-17T13:31:19Z) - Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.<n>Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.<n>We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection [16.77262005540559]
A novel framework is proposed to guide the learning of suspected anomalies from event prompts.
It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos.
Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC.
arXiv Detail & Related papers (2024-03-02T10:42:47Z) - Dynamic Erasing Network Based on Multi-Scale Temporal Features for
Weakly Supervised Video Anomaly Detection [103.92970668001277]
We propose a Dynamic Erasing Network (DE-Net) for weakly supervised video anomaly detection.
We first propose a multi-scale temporal modeling module, capable of extracting features from segments of varying lengths.
Then, we design a dynamic erasing strategy, which dynamically assesses the completeness of the detected anomalies.
arXiv Detail & Related papers (2023-12-04T09:40:11Z) - Open-Vocabulary Video Anomaly Detection [57.552523669351636]
Video anomaly detection (VAD) with weak supervision has achieved remarkable performance in utilizing video-level labels to discriminate whether a video frame is normal or abnormal.
Recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos.
This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies.
arXiv Detail & Related papers (2023-11-13T02:54:17Z) - Beyond the Benchmark: Detecting Diverse Anomalies in Videos [0.6993026261767287]
Video Anomaly Detection (VAD) plays a crucial role in modern surveillance systems, aiming to identify various anomalies in real-world situations.
Current benchmark datasets predominantly emphasize simple, single-frame anomalies such as novel object detection.
We advocate for an expansion of VAD investigations to encompass intricate anomalies that extend beyond conventional benchmark boundaries.
arXiv Detail & Related papers (2023-10-03T09:22:06Z) - Anomaly Detection in Video Sequences: A Benchmark and Computational
Model [25.25968958782081]
We contribute a new Large-scale Anomaly Detection (LAD) database as the benchmark for anomaly detection in video sequences.
It contains 2000 video sequences including normal and abnormal video clips with 14 anomaly categories including crash, fire, violence, etc.
It provides the annotation data, including video-level labels (abnormal/normal video, anomaly type) and frame-level labels (abnormal/normal video frame) to facilitate anomaly detection.
We propose a multi-task deep neural network to solve anomaly detection as a fully-supervised learning problem.
arXiv Detail & Related papers (2021-06-16T06:34:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.