VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning
- URL: http://arxiv.org/abs/2505.23504v1
- Date: Thu, 29 May 2025 14:48:10 GMT
- Title: VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning
- Authors: Liyun Zhu, Qixiang Chen, Xi Shen, Xiaodong Cun,
- Abstract summary: Video anomaly understanding is essential for smart cities, security surveillance, and disaster alert systems.<n>Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events.<n>We introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT)
- Score: 12.293826084601115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1.
Related papers
- VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding [22.43740206690383]
Video Anomaly Detection (VAD) aims to identify anomalous events in videos and accurately determine their time intervals.<n>VAGU is the first benchmark to integrate anomaly understanding and grounding.<n>We propose Glance then Scrutinize (GtS), a training-free framework guided by textual prompts.<n>We also propose the JeAUG metric, which jointly evaluates semantic interpretability and temporal precision.
arXiv Detail & Related papers (2025-07-29T05:17:48Z) - Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z) - Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought [58.321044666612174]
Vad-R1 is an end-to-end MLLM-based framework for Video Anomaly Reasoning.<n>We design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies.<n>We also propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs.
arXiv Detail & Related papers (2025-05-26T12:05:16Z) - SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models [8.402075279942256]
SurveillanceVQA-589K is the largest open-ended video question answering benchmark tailored to the surveillance domain.<n>The dataset comprises 589,380 QA pairs spanning 12 cognitively diverse question types.<n>Our benchmark provides a practical and comprehensive resource for advancing video-language understanding in safety-critical applications.
arXiv Detail & Related papers (2025-05-19T00:57:04Z) - VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models [29.706347050700867]
We introduce a novel benchmark named Video-based long-form Causal Reasoning (VCRBench)<n>VCRBench tests whether Large Video Language Models (LVLMs) can identify, reason about, and correctly sequence the events needed to accomplish a specific goal.<n>We propose Recognition-Reasoning Decomposition (RRD), a modular approach that breaks video-based causal reasoning into two sub-tasks of video recognition and causal reasoning.
arXiv Detail & Related papers (2025-05-13T11:35:58Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly [12.896651217314744]
We introduce a benchmark for Exploring the Causation of Video Anomalies (ECVA)<n>Our benchmark is meticulously designed, with each video accompanied by detailed human annotations.<n>We propose AnomEval, a specialized evaluation metric crafted to align closely with human judgment criteria for ECVA.
arXiv Detail & Related papers (2024-12-10T04:41:44Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly [29.822544507594056]
We present a benchmark for Causation Understanding of Video Anomaly (CUVA)
Each instance of the proposed benchmark involves three sets of human annotations to indicate the "what", "why" and "how" of an anomaly.
MMEval is a novel evaluation metric designed to better align with human preferences for CUVA.
arXiv Detail & Related papers (2024-04-30T20:11:49Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.