Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting
- URL: http://arxiv.org/abs/2510.02155v1
- Date: Thu, 02 Oct 2025 16:06:31 GMT
- Title: Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting
- Authors: Shu Zou, Xinyu Tian, Lukas Wesemann, Fabian Waschkowski, Zhaoyuan Yang, Jing Zhang,
- Abstract summary: We propose ASK-Hint, a structured prompting framework for video anomaly detection.<n>Our approach organizes prompts into semantically coherent groups and formulates fine-grained guiding questions.<n>Experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines.
- Score: 17.850029260662648
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.
Related papers
- Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection [52.5174167737992]
Video anomaly detection (VAD) aims to identify abnormal events in videos.<n>We propose SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations.<n>Our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data.
arXiv Detail & Related papers (2026-02-27T13:48:50Z) - PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding [85.22047087898311]
We introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings.<n>PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses to reinforce desirable outcomes.<n>Experiments on the "3H" alignment objectives demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time.
arXiv Detail & Related papers (2026-02-24T08:56:52Z) - Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method [96.63801368613177]
We present a new task that elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning.<n>We present a new dataset with 8,641 videos, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly understanding.<n>Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making.
arXiv Detail & Related papers (2026-01-15T08:09:04Z) - VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models [29.213430569936943]
We propose VADER, an LLM-driven framework for Video Anomaly unDErstanding.<n>VADER integrates object features with visual cues to enhance anomaly comprehension from video.<n> Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks.
arXiv Detail & Related papers (2025-11-10T16:56:11Z) - Learning Event Completeness for Weakly Supervised Video Anomaly Detection [5.140169437190526]
We present a novel Learning Event Completeness for Weakly Supervised Video Anomaly Detection (LEC-VAD)<n>LEC-VAD encodes both category-aware and category-agnostic semantics between vision and language.<n>We develop a novel memory bank-based prototype learning mechanism to enrich concise text descriptions associated with anomaly-event categories.
arXiv Detail & Related papers (2025-06-16T04:56:58Z) - Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection [2.749898166276854]
weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction.<n>We propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity.<n>We show that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.
arXiv Detail & Related papers (2024-12-29T12:46:57Z) - Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight [2.290956583394892]
Video anomaly detection (VAD) has witnessed significant advancements through the integration of large language models (LLMs) and vision-language models (VLMs)<n>This paper presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024.
arXiv Detail & Related papers (2024-12-24T09:05:37Z) - MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context.
Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets.
Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z) - MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation [5.0923114224599555]
Video Anomaly Detection and Video Anomaly Recognition are critically important for applications in intelligent surveillance, evidence investigation, violence alerting, etc.<n>These tasks face significant challenges due to the rarity of anomalies which leads to extremely imbalanced data and the impracticality of extensive frame-level data annotation for supervised learning.<n>This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN that addresses these challenges by leveraging a state-of-the-art large language model and a comprehensive knowledge graph for efficient weakly supervised learning in VAR.
arXiv Detail & Related papers (2024-06-27T01:09:07Z) - Hawk: Learning to Understand Open-World Video Anomalies [76.9631436818573]
Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs.
We introduce Hawk, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely.
We have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users' open-world questions.
arXiv Detail & Related papers (2024-05-27T07:08:58Z) - Inverse-RLignment: Large Language Model Alignment from Demonstrations through Inverse Reinforcement Learning [62.05713042908654]
We introduce Alignment from Demonstrations (AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges.<n>We formalize AfD within a sequential decision-making framework, highlighting its unique challenge of missing reward signals.<n> Practically, we propose a computationally efficient algorithm that extrapolates over a tailored reward model for AfD.
arXiv Detail & Related papers (2024-05-24T15:13:53Z) - Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video.
Training-based methods are prone to be domain-specific, thus being costly for practical deployment.
We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z) - Open-Vocabulary Video Anomaly Detection [57.552523669351636]
Video anomaly detection (VAD) with weak supervision has achieved remarkable performance in utilizing video-level labels to discriminate whether a video frame is normal or abnormal.
Recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos.
This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies.
arXiv Detail & Related papers (2023-11-13T02:54:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.