Related papers: Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly

Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly

URL: http://arxiv.org/abs/2412.07183v1
Date: Tue, 10 Dec 2024 04:41:44 GMT
Title: Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly
Authors: Hang Du, Guoshun Nan, Jiawen Qian, Wangchenhui Wu, Wendi Deng, Hanqing Mu, Zhenyan Chen, Pengxuan Mao, Xiaofeng Tao, Jun Liu,
Abstract summary: We introduce a benchmark for Exploring the Causation of Video Anomalies (ECVA)<n>Our benchmark is meticulously designed, with each video accompanied by detailed human annotations.<n>We propose AnomEval, a specialized evaluation metric crafted to align closely with human judgment criteria for ECVA.
Score: 12.896651217314744
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in video anomaly understanding (VAU) have opened the door to groundbreaking applications in various fields, such as traffic monitoring and industrial automation. While the current benchmarks in VAU predominantly emphasize the detection and localization of anomalies. Here, we endeavor to delve deeper into the practical aspects of VAU by addressing the essential questions: "what anomaly occurred?", "why did it happen?", and "how severe is this abnormal event?". In pursuit of these answers, we introduce a comprehensive benchmark for Exploring the Causation of Video Anomalies (ECVA). Our benchmark is meticulously designed, with each video accompanied by detailed human annotations. Specifically, each instance of our ECVA involves three sets of human annotations to indicate "what", "why" and "how" of an anomaly, including 1) anomaly type, start and end times, and event descriptions, 2) natural language explanations for the cause of an anomaly, and 3) free text reflecting the effect of the abnormality. Building upon this foundation, we propose a novel prompt-based methodology that serves as a baseline for tackling the intricate challenges posed by ECVA. We utilize "hard prompt" to guide the model to focus on the critical parts related to video anomaly segments, and "soft prompt" to establish temporal and spatial relationships within these anomaly segments. Furthermore, we propose AnomEval, a specialized evaluation metric crafted to align closely with human judgment criteria for ECVA. This metric leverages the unique features of the ECVA dataset to provide a more comprehensive and reliable assessment of various video large language models. We demonstrate the efficacy of our approach through rigorous experimental analysis and delineate possible avenues for further investigation into the comprehension of video anomaly causation.

Related papers

VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding [22.43740206690383]
Video Anomaly Detection (VAD) aims to identify anomalous events in videos and accurately determine their time intervals.<n>VAGU is the first benchmark to integrate anomaly understanding and grounding.<n>We propose Glance then Scrutinize (GtS), a training-free framework guided by textual prompts.<n>We also propose the JeAUG metric, which jointly evaluates semantic interpretability and temporal precision.
arXiv Detail & Related papers (2025-07-29T05:17:48Z)
Track Any Anomalous Object: A Granular Video Anomaly Detection Pipeline [63.96226274616927]
A new framework called Track Any Anomalous Object (TAO) introduces a granular video anomaly detection pipeline.<n>Unlike methods that assign anomaly scores to every pixel, our approach transforms the problem into pixel-level tracking of anomalous objects.<n>Experiments demonstrate that TAO sets new benchmarks in accuracy and robustness.
arXiv Detail & Related papers (2025-06-05T15:49:39Z)
VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning [12.293826084601115]
Video anomaly understanding is essential for smart cities, security surveillance, and disaster alert systems.<n>Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events.<n>We introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT)
arXiv Detail & Related papers (2025-05-29T14:48:10Z)
SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models [8.402075279942256]
SurveillanceVQA-589K is the largest open-ended video question answering benchmark tailored to the surveillance domain.<n>The dataset comprises 589,380 QA pairs spanning 12 cognitively diverse question types.<n>Our benchmark provides a practical and comprehensive resource for advancing video-language understanding in safety-critical applications.
arXiv Detail & Related papers (2025-05-19T00:57:04Z)
VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios. We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z)
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos. Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models. We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z)
Hawk: Learning to Understand Open-World Video Anomalies [76.9631436818573]
Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs. We introduce Hawk, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely. We have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users' open-world questions.
arXiv Detail & Related papers (2024-05-27T07:08:58Z)
Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly [29.822544507594056]
We present a benchmark for Causation Understanding of Video Anomaly (CUVA) Each instance of the proposed benchmark involves three sets of human annotations to indicate the "what", "why" and "how" of an anomaly. MMEval is a novel evaluation metric designed to better align with human preferences for CUVA.
arXiv Detail & Related papers (2024-04-30T20:11:49Z)
Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection [16.77262005540559]
A novel framework is proposed to guide the learning of suspected anomalies from event prompts. It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos. Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC.
arXiv Detail & Related papers (2024-03-02T10:42:47Z)
Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences. This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content. Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z)
Dynamic Erasing Network Based on Multi-Scale Temporal Features for Weakly Supervised Video Anomaly Detection [103.92970668001277]
We propose a Dynamic Erasing Network (DE-Net) for weakly supervised video anomaly detection. We first propose a multi-scale temporal modeling module, capable of extracting features from segments of varying lengths. Then, we design a dynamic erasing strategy, which dynamically assesses the completeness of the detected anomalies.
arXiv Detail & Related papers (2023-12-04T09:40:11Z)
Open-Vocabulary Video Anomaly Detection [57.552523669351636]
Video anomaly detection (VAD) with weak supervision has achieved remarkable performance in utilizing video-level labels to discriminate whether a video frame is normal or abnormal. Recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos. This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies.
arXiv Detail & Related papers (2023-11-13T02:54:17Z)
Delving into CLIP latent space for Video Anomaly Recognition [24.37974279994544]
We introduce the novel method AnomalyCLIP, the first to combine Large Language and Vision (LLV) models, such as CLIP. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class.
arXiv Detail & Related papers (2023-10-04T14:01:55Z)
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications. Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities. We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.