Video Anomaly Detection and Explanation via Large Language Models
- URL: http://arxiv.org/abs/2401.05702v1
- Date: Thu, 11 Jan 2024 07:09:44 GMT
- Title: Video Anomaly Detection and Explanation via Large Language Models
- Authors: Hui Lv and Qianru Sun
- Abstract summary: Video Anomaly Detection (VAD) aims to localize abnormal events on the timeline of long-range surveillance videos.
In this paper, we conduct pioneer research on equipping video-based large language models (VLLMs) in the framework of VAD.
We introduce a novel network module Long-Term Context (LTC) to mitigate the incapability of VLLMs in long-range context modeling.
- Score: 34.52845566893497
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Video Anomaly Detection (VAD) aims to localize abnormal events on the
timeline of long-range surveillance videos. Anomaly-scoring-based methods have
been prevailing for years but suffer from the high complexity of thresholding
and low explanability of detection results. In this paper, we conduct pioneer
research on equipping video-based large language models (VLLMs) in the
framework of VAD, making the VAD model free from thresholds and able to explain
the reasons for the detected anomalies. We introduce a novel network module
Long-Term Context (LTC) to mitigate the incapability of VLLMs in long-range
context modeling. We design a three-phase training method to improve the
efficiency of fine-tuning VLLMs by substantially minimizing the requirements
for VAD data and lowering the costs of annotating instruction-tuning data. Our
trained model achieves the top performance on the anomaly videos of the
UCF-Crime and TAD benchmarks, with the AUC improvements of +3.86\% and +4.96\%,
respectively. More impressively, our approach can provide textual explanations
for detected anomalies.
Related papers
- Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.
Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.
We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - Revisiting Catastrophic Forgetting in Large Language Model Tuning [79.70722658190097]
Catastrophic Forgetting (CF) means models forgetting previously acquired knowledge when learning new data.
This paper takes the first step to reveal the direct link between the flatness of the model loss landscape and the extent of CF in the field of large language models.
Experiments on three widely-used fine-tuning datasets, spanning different model scales, demonstrate the effectiveness of our method in alleviating CF.
arXiv Detail & Related papers (2024-06-07T11:09:13Z) - Distilling Aggregated Knowledge for Weakly-Supervised Video Anomaly Detection [11.250490586786878]
Video anomaly detection aims to develop automated models capable of identifying abnormal events in surveillance videos.
We show that distilling knowledge from aggregated representations into a relatively simple model achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-06-05T00:44:42Z) - Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video.
Training-based methods are prone to be domain-specific, thus being costly for practical deployment.
We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z) - Open-Vocabulary Video Anomaly Detection [57.552523669351636]
Video anomaly detection (VAD) with weak supervision has achieved remarkable performance in utilizing video-level labels to discriminate whether a video frame is normal or abnormal.
Recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos.
This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies.
arXiv Detail & Related papers (2023-11-13T02:54:17Z) - A Lightweight Video Anomaly Detection Model with Weak Supervision and Adaptive Instance Selection [14.089888316857426]
This paper focuses on weakly supervised video anomaly detection.
We develop a lightweight video anomaly detection model.
We show that our model can achieve comparable or even superior AUC score compared to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T01:23:08Z) - Ada-VSR: Adaptive Video Super-Resolution with Meta-Learning [56.676110454594344]
VideoSuperResolution (Ada-SR) uses external as well as internal, information through meta-transfer learning and internal learning, respectively.
Model trained using our approach can quickly adapt to a specific video condition with only a few gradient updates, which reduces the inference time significantly.
arXiv Detail & Related papers (2021-08-05T19:59:26Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z) - 3D ResNet with Ranking Loss Function for Abnormal Activity Detection in
Videos [6.692686655277163]
This study is motivated by the recent state-of-art work of abnormal activity detection.
In the absence of temporal-annotations, such a model is prone to give a false alarm while detecting the abnormalities.
In this paper, we focus on the task of minimizing the false alarm rate while performing an abnormal activity detection task.
arXiv Detail & Related papers (2020-02-04T05:32:21Z) - Delving Deeper into the Decoder for Video Captioning [23.202746094988715]
Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence.
We make a thorough investigation into the decoder and adopt three techniques to improve the performance of the model.
It is demonstrated in the experiments on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets.
arXiv Detail & Related papers (2020-01-16T02:18:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.