VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
- URL: http://arxiv.org/abs/2511.07299v1
- Date: Mon, 10 Nov 2025 16:56:11 GMT
- Title: VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
- Authors: Ying Cheng, Yu-Ho Lin, Min-Hung Chen, Fu-En Yang, Shang-Hong Lai,
- Abstract summary: We propose VADER, an LLM-driven framework for Video Anomaly unDErstanding.<n>VADER integrates object features with visual cues to enhance anomaly comprehension from video.<n> Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks.
- Score: 29.213430569936943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.
Related papers
- Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning [23.043341269626016]
We propose a novel framework named LAS-VAD, short for Learning Anomaly Semantics for WS-VAD.<n>Our framework integrates anomaly-connected component mechanism and intention awareness mechanism.<n>It outperforms current state-of-the-art methods with remarkable gains.
arXiv Detail & Related papers (2026-02-28T08:57:33Z) - Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection [52.5174167737992]
Video anomaly detection (VAD) aims to identify abnormal events in videos.<n>We propose SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations.<n>Our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data.
arXiv Detail & Related papers (2026-02-27T13:48:50Z) - Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method [96.63801368613177]
We present a new task that elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning.<n>We present a new dataset with 8,641 videos, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly understanding.<n>Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making.
arXiv Detail & Related papers (2026-01-15T08:09:04Z) - A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis [64.42659342276117]
Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal.<n>Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific.<n>We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation.
arXiv Detail & Related papers (2025-11-02T14:49:08Z) - Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection [33.77002721234086]
We propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs)<n>Our method focuses on extracting and interpreting object activity and interactions over time.<n>Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability.
arXiv Detail & Related papers (2025-10-16T17:13:33Z) - VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning [12.293826084601115]
Video anomaly understanding is essential for smart cities, security surveillance, and disaster alert systems.<n>Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events.<n>We introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT)
arXiv Detail & Related papers (2025-05-29T14:48:10Z) - Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly [12.896651217314744]
We introduce a benchmark for Exploring the Causation of Video Anomalies (ECVA)<n>Our benchmark is meticulously designed, with each video accompanied by detailed human annotations.<n>We propose AnomEval, a specialized evaluation metric crafted to align closely with human judgment criteria for ECVA.
arXiv Detail & Related papers (2024-12-10T04:41:44Z) - Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity [35.14762107193339]
HIVAU-70k is a benchmark for hierarchical video anomaly understanding across any granularity.<n>We develop a semi-automated annotation engine that efficiently scales high-quality annotations.<n>For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler.
arXiv Detail & Related papers (2024-12-09T03:05:34Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.<n>Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.<n>We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - Open-Vocabulary Video Anomaly Detection [57.552523669351636]
Video anomaly detection (VAD) with weak supervision has achieved remarkable performance in utilizing video-level labels to discriminate whether a video frame is normal or abnormal.
Recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos.
This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies.
arXiv Detail & Related papers (2023-11-13T02:54:17Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Spatio-Temporal Relation Learning for Video Anomaly Detection [35.59510027883497]
Anomaly identification is highly dependent on the relationship between the object and the scene.
In this paper, we propose a Spatial-Temporal Relation Learning framework to tackle the video anomaly detection task.
Experiments are conducted on three public datasets, and the superior performance over the state-of-the-art methods demonstrates the effectiveness of our method.
arXiv Detail & Related papers (2022-09-27T02:19:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.