Related papers: VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding

VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding

URL: http://arxiv.org/abs/2507.21507v1
Date: Tue, 29 Jul 2025 05:17:48 GMT
Title: VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding
Authors: Shibo Gao, Peipei Yang, Yangyang Liu, Yi Chen, Han Zhu, Xuyao Zhang, Linlin Huang,
Abstract summary: Video Anomaly Detection (VAD) aims to identify anomalous events in videos and accurately determine their time intervals.<n>VAGU is the first benchmark to integrate anomaly understanding and grounding.<n>We propose Glance then Scrutinize (GtS), a training-free framework guided by textual prompts.<n>We also propose the JeAUG metric, which jointly evaluates semantic interpretability and temporal precision.
Score: 22.43740206690383
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Anomaly Detection (VAD) aims to identify anomalous events in videos and accurately determine their time intervals. Current VAD methods mainly fall into two categories: traditional DNN-based approaches that focus on temporal localization, and LLM-based approaches that emphasize semantic understanding. Both anomaly understanding and grounding are essential for comprehensive video anomaly detection and can complement each other. However, no existing model or dataset supports both tasks simultaneously. To address this, we introduce VAGU (Video Anomaly Grounding and Understanding), the first benchmark to integrate both tasks. Each VAGU instance includes annotations for anomaly category, semantic explanation, precise temporal grounding and Video QA. We also provide multiple-choice Video QA for objective evaluation. Based on this dataset, we propose Glance then Scrutinize (GtS), a training-free framework guided by textual prompts. The framework first enables coarse localization of high-probability anomalous regions, followed by detailed anomaly interpretation and temporal boundary refinement. Additionally, we propose the JeAUG metric, which jointly evaluates semantic interpretability and temporal precision, overcoming the limitations of traditional metrics. Extensive experiments verify the effectiveness of our benchmark, framework, and evaluation metric.

Related papers

A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis [64.42659342276117]
Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal.<n>Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific.<n>We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation.
arXiv Detail & Related papers (2025-11-02T14:49:08Z)
SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding [48.64661382961745]
We introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos.<n>To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs.<n> Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes.
arXiv Detail & Related papers (2025-10-14T22:10:49Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations [33.74746234704817]
Video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description.<n>This is challenging as it involves deep vision-level understanding, pixel-level dense prediction andtemporal reasoning.<n>We propose bfReferDINO RVOS that inherits region-level vision-text alignment from foundational visual grounding models.
arXiv Detail & Related papers (2025-01-24T16:24:15Z)
Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly [12.896651217314744]
We introduce a benchmark for Exploring the Causation of Video Anomalies (ECVA)<n>Our benchmark is meticulously designed, with each video accompanied by detailed human annotations.<n>We propose AnomEval, a specialized evaluation metric crafted to align closely with human judgment criteria for ECVA.
arXiv Detail & Related papers (2024-12-10T04:41:44Z)
Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity [35.14762107193339]
HIVAU-70k is a benchmark for hierarchical video anomaly understanding across any granularity.<n>We develop a semi-automated annotation engine that efficiently scales high-quality annotations.<n>For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler.
arXiv Detail & Related papers (2024-12-09T03:05:34Z)
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z)
On the Consistency of Video Large Language Models in Temporal Comprehension [57.985769348320616]
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments.<n>We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
arXiv Detail & Related papers (2024-11-20T00:47:17Z)
Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs) Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z)
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM [35.06386971859359]
Holmes-VAD is a novel framework that leverages precise temporal supervision and rich multimodal instructions. We construct the first large-scale multimodal VAD instruction-tuning benchmark, VAD-Instruct50k. Building upon the VAD-Instruct50k dataset, we develop a customized solution for interpretable video anomaly detection.
arXiv Detail & Related papers (2024-06-18T03:19:24Z)
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications. Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities. We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.