Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models
- URL: http://arxiv.org/abs/2510.16290v1
- Date: Sat, 18 Oct 2025 01:27:23 GMT
- Title: Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models
- Authors: Yue Zheng, Xiufang Shi, Jiming Chen, Yuanchao Shu,
- Abstract summary: Cerberus is a two-stage cascaded system designed for efficient yet accurate real-time VAD.<n>It learns normal behavioral rules offline, and combines lightweight filtering with fine-grained VLM reasoning during online inference.<n>Cerberus on average achieves 57.68 fps on an NVIDIA L40S GPU, a 151.79$times$ speedup, and 97.2% accuracy comparable to the state-of-the-art VLM-based VAD methods.
- Score: 20.102770709407437
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video anomaly detection (VAD) has rapidly advanced by recent development of Vision-Language Models (VLMs). While these models offer superior zero-shot detection capabilities, their immense computational cost and unstable visual grounding performance hinder real-time deployment. To overcome these challenges, we introduce Cerberus, a two-stage cascaded system designed for efficient yet accurate real-time VAD. Cerberus learns normal behavioral rules offline, and combines lightweight filtering with fine-grained VLM reasoning during online inference. The performance gains of Cerberus come from two key innovations: motion mask prompting and rule-based deviation detection. The former directs the VLM's attention to regions relevant to motion, while the latter identifies anomalies as deviations from learned norms rather than enumerating possible anomalies. Extensive evaluations on four datasets show that Cerberus on average achieves 57.68 fps on an NVIDIA L40S GPU, a 151.79$\times$ speedup, and 97.2\% accuracy comparable to the state-of-the-art VLM-based VAD methods, establishing it as a practical solution for real-time video analytics.
Related papers
- No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection [15.949619310702579]
Existing video anomaly detection methods under perform in open-world scenarios.<n>Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics.<n>We propose LAVIDA, an end-to-end zero-shot video anomaly detection framework.
arXiv Detail & Related papers (2026-02-22T16:03:43Z) - ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models [23.37951284612929]
We construct a dataset of over 30K instances spanning dynamic perception, scientific reasoning, and embodied decision-making domains.<n>In ViRectify, we challenge MLLMs to perform step-wise error identification and generate rationales with key video evidence grounding.<n>In addition, we propose the trajectory evidence-driven correction framework, comprising step-wise error trajectory and reward modeling on visual evidence-grounded correction.
arXiv Detail & Related papers (2025-12-01T09:05:02Z) - MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection [28.5803063507761]
Video Anomaly Detection (VAD) aims to locate unusual activities or behaviors within videos.<n>Online VAD has seldom received attention due to real-time constraints and computational intensity.<n>We introduce a novel Memory-based online scoring queue scheme for Training-free VAD (MoniTor)
arXiv Detail & Related papers (2025-10-24T13:28:29Z) - Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought [58.321044666612174]
Vad-R1 is an end-to-end MLLM-based framework for Video Anomaly Reasoning.<n>We design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies.<n>We also propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs.
arXiv Detail & Related papers (2025-05-26T12:05:16Z) - Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection [11.197888893266535]
Flashback is a zero-shot and real-time video anomaly detection paradigm.<n>Inspired by the human cognitive mechanism of instantly judging anomalies, Flashback operates in two stages: Recall and Respond.<n>By eliminating all LLM calls at inference time, Flashback delivers real-time VAD even on a consumer-grade GPU.
arXiv Detail & Related papers (2025-05-21T07:32:29Z) - SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model [52.47816604709358]
Video anomaly detection (VAD) aims to identify unexpected events in videos and has wide applications in safety-critical domains.<n> vision-language models (VLMs) have demonstrated strong multimodal reasoning capabilities, offering new opportunities for anomaly detection.<n>We propose SlowFastVAD, a hybrid framework that integrates a fast anomaly detector with a slow anomaly detector.
arXiv Detail & Related papers (2025-04-14T15:30:03Z) - AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis [52.261173507177396]
We introduce AssistPDA, the first online video anomaly surveillance assistant (VAPDA) that unifies anomaly prediction, detection, and analysis (VAPDA) within a single framework.<n> AssistPDA enables real-time inference on streaming videos while supporting interactive user engagement.<n>We also introduce a novel event-level anomaly prediction task, enabling proactive anomaly forecasting before anomalies fully unfold.
arXiv Detail & Related papers (2025-03-27T18:30:47Z) - Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - Beyond the Benchmark: Detecting Diverse Anomalies in Videos [0.6993026261767287]
Video Anomaly Detection (VAD) plays a crucial role in modern surveillance systems, aiming to identify various anomalies in real-world situations.
Current benchmark datasets predominantly emphasize simple, single-frame anomalies such as novel object detection.
We advocate for an expansion of VAD investigations to encompass intricate anomalies that extend beyond conventional benchmark boundaries.
arXiv Detail & Related papers (2023-10-03T09:22:06Z) - It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - Anomaly detection in surveillance videos using transformer based
attention model [3.2968779106235586]
This research suggests using a weakly supervised strategy to avoid annotating anomalous segments in training videos.
The proposed framework is validated on real-world dataset i.e. ShanghaiTech Campus dataset.
arXiv Detail & Related papers (2022-06-03T12:19:39Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.