A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis
- URL: http://arxiv.org/abs/2511.00962v1
- Date: Sun, 02 Nov 2025 14:49:08 GMT
- Title: A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis
- Authors: Dongheng Lin, Mengxue Qu, Kunyang Han, Jianbo Jiao, Xiaojie Jin, Yunchao Wei,
- Abstract summary: Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal.<n>Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific.<n>We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation.
- Score: 64.42659342276117
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner. Project Page: https://rathgrith.github.io/Unified_Frame_VAA/.
Related papers
- ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning [44.49803237328707]
ReVSeg executes reasoning as sequential decisions in the native interface of pretrained vision language models.<n>We employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals.
arXiv Detail & Related papers (2025-12-02T14:44:12Z) - VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models [29.213430569936943]
We propose VADER, an LLM-driven framework for Video Anomaly unDErstanding.<n>VADER integrates object features with visual cues to enhance anomaly comprehension from video.<n> Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks.
arXiv Detail & Related papers (2025-11-10T16:56:11Z) - Action Hints: Semantic Typicality and Context Uniqueness for Generalizable Skeleton-based Video Anomaly Detection [39.65895515365808]
We propose a novel zero-shot video anomaly detection framework, unlocking the potential of skeleton data via action typicality and uniqueness learning.<n>Our method achieves state-of-the-art results against skeleton-based methods on four large-scale VAD datasets.
arXiv Detail & Related papers (2025-09-14T02:51:32Z) - VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding [22.43740206690383]
Video Anomaly Detection (VAD) aims to identify anomalous events in videos and accurately determine their time intervals.<n>VAGU is the first benchmark to integrate anomaly understanding and grounding.<n>We propose Glance then Scrutinize (GtS), a training-free framework guided by textual prompts.<n>We also propose the JeAUG metric, which jointly evaluates semantic interpretability and temporal precision.
arXiv Detail & Related papers (2025-07-29T05:17:48Z) - VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning [12.293826084601115]
Video anomaly understanding is essential for smart cities, security surveillance, and disaster alert systems.<n>Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events.<n>We introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT)
arXiv Detail & Related papers (2025-05-29T14:48:10Z) - Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - Uncovering the Missing Pattern: Unified Framework Towards Trajectory
Imputation and Prediction [60.60223171143206]
Trajectory prediction is a crucial undertaking in understanding entity movement or human behavior from observed sequences.
Current methods often assume that the observed sequences are complete while ignoring the potential for missing values.
This paper presents a unified framework, the Graph-based Conditional Variational Recurrent Neural Network (GC-VRNN), which can perform trajectory imputation and prediction simultaneously.
arXiv Detail & Related papers (2023-03-28T14:27:27Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.