Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model
- URL: http://arxiv.org/abs/2307.12545v2
- Date: Wed, 28 Feb 2024 02:24:09 GMT
- Title: Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model
- Authors: Peng Wu, Jing Liu, Xiangteng He, Yuxin Peng, Peng Wang, and Yanning
Zhang
- Abstract summary: Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
- Score: 70.97446870672069
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video anomaly detection (VAD) has been paid increasing attention due to its
potential applications, its current dominant tasks focus on online detecting
anomalies% at the frame level, which can be roughly interpreted as the binary
or multiple event classification. However, such a setup that builds
relationships between complicated anomalous events and single labels, e.g.,
``vandalism'', is superficial, since single labels are deficient to
characterize anomalous events. In reality, users tend to search a specific
video rather than a series of approximate videos. Therefore, retrieving
anomalous events using detailed descriptions is practical and positive but few
researches focus on this. In this context, we propose a novel task called Video
Anomaly Retrieval (VAR), which aims to pragmatically retrieve relevant
anomalous videos by cross-modalities, e.g., language descriptions and
synchronous audios. Unlike the current video retrieval where videos are assumed
to be temporally well-trimmed with short duration, VAR is devised to retrieve
long untrimmed videos which may be partially relevant to the given query. To
achieve this, we present two large-scale VAR benchmarks, UCFCrime-AR and
XDViolence-AR, constructed on top of prevalent anomaly datasets. Meanwhile, we
design a model called Anomaly-Led Alignment Network (ALAN) for VAR. In ALAN, we
propose an anomaly-led sampling to focus on key segments in long untrimmed
videos. Then, we introduce an efficient pretext task to enhance semantic
associations between video-text fine-grained representations. Besides, we
leverage two complementary alignments to further match cross-modal contents.
Experimental results on two benchmarks reveal the challenges of VAR task and
also demonstrate the advantages of our tailored method. Captions are publicly
released at https://github.com/Roc-Ng/VAR.
Related papers
- Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM [35.06386971859359]
Holmes-VAD is a novel framework that leverages precise temporal supervision and rich multimodal instructions.
We construct the first large-scale multimodal VAD instruction-tuning benchmark, VAD-Instruct50k.
Building upon the VAD-Instruct50k dataset, we develop a customized solution for interpretable video anomaly detection.
arXiv Detail & Related papers (2024-06-18T03:19:24Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.
Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.
We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Anomaly detection in surveillance videos using transformer based
attention model [3.2968779106235586]
This research suggests using a weakly supervised strategy to avoid annotating anomalous segments in training videos.
The proposed framework is validated on real-world dataset i.e. ShanghaiTech Campus dataset.
arXiv Detail & Related papers (2022-06-03T12:19:39Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - A Self-Reasoning Framework for Anomaly Detection Using Video-Level
Labels [17.615297975503648]
Alous event detection in surveillance videos is a challenging and practical research problem among image and video processing community.
We propose a weakly supervised anomaly detection framework based on deep neural networks which is trained in a self-reasoning fashion using only video-level labels.
The proposed framework has been evaluated on publicly available real-world anomaly detection datasets including UCF-crime, ShanghaiTech and Ped2.
arXiv Detail & Related papers (2020-08-27T02:14:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.