Towards Surveillance Video-and-Language Understanding: New Dataset,
Baselines, and Challenges
- URL: http://arxiv.org/abs/2309.13925v2
- Date: Mon, 4 Dec 2023 13:34:01 GMT
- Title: Towards Surveillance Video-and-Language Understanding: New Dataset,
Baselines, and Challenges
- Authors: Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Chen Chen, Jian Jin,
Zhenzhen Jiao
- Abstract summary: We propose a new research direction of surveillance video-and-language understanding, and construct the first multimodal surveillance video dataset.
We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing.
We benchmark SOTA models for four multimodal tasks on this newly created dataset, which serve as new baselines for surveillance video-and-language understanding.
- Score: 10.809558232493236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Surveillance videos are an essential component of daily life with various
critical applications, particularly in public security. However, current
surveillance video tasks mainly focus on classifying and localizing anomalous
events. Existing methods are limited to detecting and classifying the
predefined events with unsatisfactory semantic understanding, although they
have obtained considerable performance. To address this issue, we propose a new
research direction of surveillance video-and-language understanding, and
construct the first multimodal surveillance video dataset. We manually annotate
the real-world surveillance dataset UCF-Crime with fine-grained event content
and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains
23,542 sentences, with an average length of 20 words, and its annotated videos
are as long as 110.7 hours. Furthermore, we benchmark SOTA models for four
multimodal tasks on this newly created dataset, which serve as new baselines
for surveillance video-and-language understanding. Through our experiments, we
find that mainstream models used in previously publicly available datasets
perform poorly on surveillance video, which demonstrates the new challenges in
surveillance video-and-language understanding. To validate the effectiveness of
our UCA, we conducted experiments on multimodal anomaly detection. The results
demonstrate that our multimodal surveillance learning can improve the
performance of conventional anomaly detection tasks. All the experiments
highlight the necessity of constructing this dataset to advance surveillance
AI. The link to our dataset is provided at:
https://xuange923.github.io/Surveillance-Video-Understanding.
Related papers
- Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking [90.81846867441993]
This paper presents the first investigation on preventing personal video data from unauthorized exploitation by deep trackers.<n>We propose a novel generative framework for generating Temporal Unlearnable Examples (TUEs)<n>Our approach achieves state-of-the-art performance in video data-privacy protection, with strong transferability across VOT models, datasets, and temporal matching tasks.
arXiv Detail & Related papers (2025-07-10T07:11:33Z) - SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models [8.402075279942256]
SurveillanceVQA-589K is the largest open-ended video question answering benchmark tailored to the surveillance domain.<n>The dataset comprises 589,380 QA pairs spanning 12 cognitively diverse question types.<n>Our benchmark provides a practical and comprehensive resource for advancing video-language understanding in safety-critical applications.
arXiv Detail & Related papers (2025-05-19T00:57:04Z) - MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation [5.0923114224599555]
Video Anomaly Detection and Video Anomaly Recognition are critically important for applications in intelligent surveillance, evidence investigation, violence alerting, etc.<n>These tasks face significant challenges due to the rarity of anomalies which leads to extremely imbalanced data and the impracticality of extensive frame-level data annotation for supervised learning.<n>This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN that addresses these challenges by leveraging a state-of-the-art large language model and a comprehensive knowledge graph for efficient weakly supervised learning in VAR.
arXiv Detail & Related papers (2024-06-27T01:09:07Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.
Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.
We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline [7.917971102697765]
Unsupervised (US) video anomaly detection (VAD) in surveillance applications is gaining more popularity.
In this paper, we propose a new baseline for anomaly detection capable of localizing anomalous events in complex surveillance videos in a fully unsupervised fashion.
We modify existing VAD datasets to extensively evaluate our approach as well as existing US SOTA methods on two large-scale datasets.
arXiv Detail & Related papers (2024-04-01T01:25:06Z) - Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - A New Comprehensive Benchmark for Semi-supervised Video Anomaly
Detection and Anticipation [46.687762316415096]
We propose a new comprehensive dataset, NWPU Campus, containing 43 scenes, 28 classes of abnormal events, and 16 hours of videos.
It is the largest semi-supervised VAD dataset with the largest number of scenes and classes of anomalies, the longest duration, and the only one considering the scene-dependent anomaly.
We propose a novel model capable of detecting and anticipating anomalous events simultaneously.
arXiv Detail & Related papers (2023-05-23T02:20:12Z) - CHAD: Charlotte Anomaly Dataset [2.6774008509840996]
We present the Charlotte Anomaly dataset (CHAD) for video anomaly detection.
CHAD is the first anomaly dataset to include bounding box, identity, and pose annotations for each actor.
With four camera views and over 1.15 million frames, CHAD is the largest fully annotated anomaly detection dataset.
arXiv Detail & Related papers (2022-12-19T06:05:34Z) - Anomaly detection in surveillance videos using transformer based
attention model [3.2968779106235586]
This research suggests using a weakly supervised strategy to avoid annotating anomalous segments in training videos.
The proposed framework is validated on real-world dataset i.e. ShanghaiTech Campus dataset.
arXiv Detail & Related papers (2022-06-03T12:19:39Z) - Anomaly Crossing: A New Method for Video Anomaly Detection as
Cross-domain Few-shot Learning [32.0713939637202]
Video anomaly detection aims to identify abnormal events that occurred in videos.
Most previous approaches learn only from normal videos using unsupervised or semi-supervised methods.
We propose a new learning paradigm by making full use of both normal and abnormal videos for video anomaly detection.
arXiv Detail & Related papers (2021-12-12T20:49:38Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.