Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark
- URL: http://arxiv.org/abs/2506.01466v1
- Date: Mon, 02 Jun 2025 09:23:58 GMT
- Title: Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark
- Authors: Shuyu Yang, Yilun Wang, Yaxiong Wang, Li Zhu, Zhedong Zheng,
- Abstract summary: Video anomaly retrieval aims to localize anomalous events in videos using natural language queries to facilitate public safety.<n>Existing datasets suffer from data scarcity due to the long-tail nature of real-world anomalies, and privacy constraints that impede large-scale collection.<n>We introduce SVTA (Synthetic Video-Text Anomaly benchmark), the first large-scale dataset for cross-modal anomaly retrieval.
- Score: 26.948237287675116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video anomaly retrieval aims to localize anomalous events in videos using natural language queries to facilitate public safety. However, existing datasets suffer from severe limitations: (1) data scarcity due to the long-tail nature of real-world anomalies, and (2) privacy constraints that impede large-scale collection. To address the aforementioned issues in one go, we introduce SVTA (Synthetic Video-Text Anomaly benchmark), the first large-scale dataset for cross-modal anomaly retrieval, leveraging generative models to overcome data availability challenges. Specifically, we collect and generate video descriptions via the off-the-shelf LLM (Large Language Model) covering 68 anomaly categories, e.g., throwing, stealing, and shooting. These descriptions encompass common long-tail events. We adopt these texts to guide the video generative model to produce diverse and high-quality videos. Finally, our SVTA involves 41,315 videos (1.36M frames) with paired captions, covering 30 normal activities, e.g., standing, walking, and sports, and 68 anomalous events, e.g., falling, fighting, theft, explosions, and natural disasters. We adopt three widely-used video-text retrieval baselines to comprehensively test our SVTA, revealing SVTA's challenging nature and its effectiveness in evaluating a robust cross-modal retrieval method. SVTA eliminates privacy risks associated with real-world anomaly collection while maintaining realistic scenarios. The dataset demo is available at: [https://svta-mm.github.io/SVTA.github.io/].
Related papers
- PV-VTT: A Privacy-Centric Dataset for Mission-Specific Anomaly Detection and Natural Language Interpretation [5.0923114224599555]
We present PV-VTT (Privacy Violation Video To Text), a unique multimodal dataset aimed at identifying privacy violations.<n> PV-VTT provides detailed annotations for both video and text in scenarios.<n>This privacy-focused approach allows researchers to use the dataset while protecting participant confidentiality.
arXiv Detail & Related papers (2024-10-30T01:02:20Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.<n>Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.<n>We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Dynamic Erasing Network Based on Multi-Scale Temporal Features for
Weakly Supervised Video Anomaly Detection [103.92970668001277]
We propose a Dynamic Erasing Network (DE-Net) for weakly supervised video anomaly detection.
We first propose a multi-scale temporal modeling module, capable of extracting features from segments of varying lengths.
Then, we design a dynamic erasing strategy, which dynamically assesses the completeness of the detected anomalies.
arXiv Detail & Related papers (2023-12-04T09:40:11Z) - Towards Surveillance Video-and-Language Understanding: New Dataset,
Baselines, and Challenges [10.809558232493236]
We propose a new research direction of surveillance video-and-language understanding, and construct the first multimodal surveillance video dataset.
We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing.
We benchmark SOTA models for four multimodal tasks on this newly created dataset, which serve as new baselines for surveillance video-and-language understanding.
arXiv Detail & Related papers (2023-09-25T07:46:56Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Anomaly Detection in Aerial Videos with Transformers [49.011385492802674]
We create a new dataset, named DroneAnomaly, for anomaly detection in aerial videos.
There are 87,488 color video frames (51,635 for training and 35,853 for testing) with the size of $640 times 640$ at 30 frames per second.
We present a new baseline model, ANomaly Detection with Transformers (ANDT), which treats consecutive video frames as a sequence of tubelets.
arXiv Detail & Related papers (2022-09-25T21:24:18Z) - Anomaly Detection in Video Sequences: A Benchmark and Computational
Model [25.25968958782081]
We contribute a new Large-scale Anomaly Detection (LAD) database as the benchmark for anomaly detection in video sequences.
It contains 2000 video sequences including normal and abnormal video clips with 14 anomaly categories including crash, fire, violence, etc.
It provides the annotation data, including video-level labels (abnormal/normal video, anomaly type) and frame-level labels (abnormal/normal video frame) to facilitate anomaly detection.
We propose a multi-task deep neural network to solve anomaly detection as a fully-supervised learning problem.
arXiv Detail & Related papers (2021-06-16T06:34:38Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.