FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding
- URL: http://arxiv.org/abs/2601.17258v1
- Date: Sat, 24 Jan 2026 02:17:07 GMT
- Title: FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding
- Authors: João Pereira, Vasco Lopes, João Neves, David Semedo,
- Abstract summary: Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos.<n>Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation.<n>We propose FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos.
- Score: 3.451422886843121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos. Despite growing interest, the evaluation of VAU remains an open challenge. Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation. The first fails to capture the rich, free-form, and visually grounded nature of LVLM responses, while the latter focuses on assessing language quality over factual relevance, often resulting in subjective judgments that are misaligned with human perception. In this work, we address this issue by proposing FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos. We formulate VAU as a three-fold problem, with the goal of comprehensively understanding key descriptive elements of anomalies in video: events (What), participating entities (Who) and location (Where). Our benchmark introduces a) FVScore, a novel, human-aligned evaluation metric that assesses the presence of critical visual elements in LVLM answers, providing interpretable, fine-grained feedback; and b) FineW3, a novel, comprehensive dataset curated through a structured and fully automatic procedure that augments existing human annotations with high quality, fine-grained visual information. Human evaluation reveals that our proposed metric has a superior alignment with human perception of anomalies in comparison to current approaches. Detailed experiments on FineVAU unveil critical limitations in LVLM's ability to perceive anomalous events that require spatial and fine-grained temporal understanding, despite strong performance on coarse grain, static information, and events with strong visual cues.
Related papers
- QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models [14.860588888047708]
QuantiPhy is the first benchmark designed to quantitatively measure a VLM's physical reasoning ability.<n>Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness.
arXiv Detail & Related papers (2025-12-22T16:18:00Z) - HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z) - Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models [77.96693360763925]
Video SimpleQA is the first comprehensive benchmark tailored for factuality evaluation in video contexts.<n>Our work differs from existing video benchmarks through the following key features: Knowledge required: demanding integration of external knowledge beyond the video's explicit narrative.<n>Short-form definitive answer: Answers are crafted as unambiguous and definitively correct in a short format with minimal scoring variance.
arXiv Detail & Related papers (2025-03-24T17:46:09Z) - HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.<n>A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z) - Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly [12.896651217314744]
We introduce a benchmark for Exploring the Causation of Video Anomalies (ECVA)<n>Our benchmark is meticulously designed, with each video accompanied by detailed human annotations.<n>We propose AnomEval, a specialized evaluation metric crafted to align closely with human judgment criteria for ECVA.
arXiv Detail & Related papers (2024-12-10T04:41:44Z) - FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning [15.363132825156477]
We introduce FIOVA, a human-centric benchmark tailored for evaluation of large vision-language models (LVLMs)<n>It comprises 3,002 real-world videos (about 33.6s each), each annotated independently by five annotators.<n>We propose FIOVA-DQ, an event-level evaluation metric that incorporates cognitive weights derived from annotator consensus.
arXiv Detail & Related papers (2024-10-20T03:59:54Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.<n>Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.<n>We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly [29.822544507594056]
We present a benchmark for Causation Understanding of Video Anomaly (CUVA)
Each instance of the proposed benchmark involves three sets of human annotations to indicate the "what", "why" and "how" of an anomaly.
MMEval is a novel evaluation metric designed to better align with human preferences for CUVA.
arXiv Detail & Related papers (2024-04-30T20:11:49Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video
Quality Assessment [54.31355080688127]
We introduce a text-prompted Semantic Affinity Quality Index (SAQI) and its localized version (SAQI-Local) using Contrastive Language-Image Pre-training (CLIP)
BVQI-Local demonstrates unprecedented performance, surpassing existing zero-shot indices by at least 24% on all datasets.
We conduct comprehensive analyses to investigate different quality concerns of distinct indices, demonstrating the effectiveness and rationality of our design.
arXiv Detail & Related papers (2023-04-28T08:06:05Z) - Exploring Opinion-unaware Video Quality Assessment with Semantic
Affinity Criterion [52.07084862209754]
We introduce an explicit semantic affinity index for opinion-unaware VQA using text-prompts in the contrastive language-image pre-training model.
We also aggregate it with different traditional low-level naturalness indexes through gaussian normalization and sigmoid rescaling strategies.
The proposed Blind Unified Opinion-Unaware Video Quality Index via Semantic and Technical Metric Aggregation (BUONA-VISTA) outperforms existing opinion-unaware VQA methods by at least 20% improvements.
arXiv Detail & Related papers (2023-02-26T08:46:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.