DUAL-VAD: Dual Benchmarks and Anomaly-Focused Sampling for Video Anomaly Detection
- URL: http://arxiv.org/abs/2509.11605v2
- Date: Tue, 16 Sep 2025 01:19:35 GMT
- Title: DUAL-VAD: Dual Benchmarks and Anomaly-Focused Sampling for Video Anomaly Detection
- Authors: Seoik Jung, Taekyung Song, Joshua Jordan Daniel, JinYoung Lee, SungJun Lee,
- Abstract summary: Video Anomaly Detection (VAD) is critical for surveillance and public safety.<n>Existing benchmarks are limited to either frame-level or video-level tasks.<n>This work introduces a softmax-based frame allocation strategy that prioritizes anomaly-dense segments while maintaining full-video coverage.
- Score: 8.294763803639391
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Anomaly Detection (VAD) is critical for surveillance and public safety. However, existing benchmarks are limited to either frame-level or video-level tasks, restricting a holistic view of model generalization. This work first introduces a softmax-based frame allocation strategy that prioritizes anomaly-dense segments while maintaining full-video coverage, enabling balanced sampling across temporal scales. Building on this process, we construct two complementary benchmarks. The image-based benchmark evaluates frame-level reasoning with representative frames, while the video-based benchmark extends to temporally localized segments and incorporates an abnormality scoring task. Experiments on UCF-Crime demonstrate improvements at both the frame and video levels, and ablation studies confirm clear advantages of anomaly-focused sampling over uniform and random baselines.
Related papers
- VIPER: Process-aware Evaluation for Generative Video Reasoning [64.86465792516658]
We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning.<n>Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking.
arXiv Detail & Related papers (2025-12-31T16:31:59Z) - Frame Sampling Strategies Matter: A Benchmark for small vision language models [3.719563722270237]
We propose the first frame-accurate benchmark of state-of-the-art small vision language models for video question-answering.<n>Our results confirm the suspected bias and highlight both data-specific and task-specific behaviors of SVLMs under different frame-sampling techniques.
arXiv Detail & Related papers (2025-09-18T09:18:42Z) - Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing [26.317163478761916]
Weakly-supervised audio-visual video parsing seeks to detect audible, visible, and audio-visual events without temporal annotations.<n>We propose an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks.<n>We also propose a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs.
arXiv Detail & Related papers (2025-09-17T15:38:05Z) - Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding [0.0]
We propose a framework that fuses video, image, and textcoding using GRU-based sequence encoders and cross-modal attention mechanisms.<n>Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines.
arXiv Detail & Related papers (2025-07-04T12:35:52Z) - Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - A Coarse-to-Fine Pseudo-Labeling (C2FPL) Framework for Unsupervised
Video Anomaly Detection [4.494911384096143]
Detection of anomalous events in videos is an important problem in applications such as surveillance.
We propose a simple-but-effective two-stage pseudo-label generation framework that produces segment-level (normal/anomaly) pseudo-labels.
The proposed coarse-to-fine pseudo-label generator employs carefully-designed hierarchical divisive clustering and statistical hypothesis testing.
arXiv Detail & Related papers (2023-10-26T17:59:19Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Local-Global Associative Frame Assemble in Video Re-ID [57.7470971197962]
Noisy and unrepresentative frames in automatically generated object bounding boxes from video sequences cause challenges in learning discriminative representations in video re-identification (Re-ID)
Most existing methods tackle this problem by assessing the importance of video frames according to either their local part alignments or global appearance correlations separately.
In this work, we explore jointly both local alignments and global correlations with further consideration of their mutual promotion/reinforcement.
arXiv Detail & Related papers (2021-10-22T19:07:39Z) - MIST: Multiple Instance Self-Training Framework for Video Anomaly
Detection [76.80153360498797]
We develop a multiple instance self-training framework (MIST) to efficiently refine task-specific discriminative representations.
MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder.
Our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
arXiv Detail & Related papers (2021-04-04T15:47:14Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z) - A Self-Reasoning Framework for Anomaly Detection Using Video-Level
Labels [17.615297975503648]
Alous event detection in surveillance videos is a challenging and practical research problem among image and video processing community.
We propose a weakly supervised anomaly detection framework based on deep neural networks which is trained in a self-reasoning fashion using only video-level labels.
The proposed framework has been evaluated on publicly available real-world anomaly detection datasets including UCF-crime, ShanghaiTech and Ped2.
arXiv Detail & Related papers (2020-08-27T02:14:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.