Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight
- URL: http://arxiv.org/abs/2412.18298v1
- Date: Tue, 24 Dec 2024 09:05:37 GMT
- Title: Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight
- Authors: Xi Ding, Lei Wang,
- Abstract summary: Video anomaly detection (VAD) has witnessed significant advancements through the integration of large language models (LLMs) and vision-language models (VLMs)
This paper presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024.
- Score: 2.290956583394892
- License:
- Abstract: Video anomaly detection (VAD) has witnessed significant advancements through the integration of large language models (LLMs) and vision-language models (VLMs), addressing critical challenges such as interpretability, temporal reasoning, and generalization in dynamic, open-world scenarios. This paper presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024, focusing on four key aspects: (i) enhancing interpretability through semantic insights and textual explanations, making visual anomalies more understandable; (ii) capturing intricate temporal relationships to detect and localize dynamic anomalies across video frames; (iii) enabling few-shot and zero-shot detection to minimize reliance on large, annotated datasets; and (iv) addressing open-world and class-agnostic anomalies by using semantic understanding and motion features for spatiotemporal coherence. We highlight their potential to redefine the landscape of VAD. Additionally, we explore the synergy between visual and textual modalities offered by LLMs and VLMs, highlighting their combined strengths and proposing future directions to fully exploit the potential in enhancing video anomaly detection.
Related papers
- VilBias: A Study of Bias Detection through Linguistic and Visual Cues , presenting Annotation Strategies, Evaluation, and Key Challenges [2.2751168722976587]
VLBias is a framework that leverages state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) to detect linguistic and visual biases in news content.
We present a multimodal dataset comprising textual content and corresponding images from diverse news sources.
arXiv Detail & Related papers (2024-12-22T15:05:30Z) - VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models [20.92507667350599]
We introduce a verbalized learning framework named VERA that enables vision-language models to perform video anomaly detection.
VERA decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions.
During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores.
arXiv Detail & Related papers (2024-12-02T04:10:14Z) - Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts [65.04791072532106]
We present LoCoVQA, a benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs)
LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts.
This test assesses how well VLMs can ignore irrelevant information when answering queries.
arXiv Detail & Related papers (2024-06-24T17:58:03Z) - Video Anomaly Detection in 10 Years: A Survey and Outlook [10.143205531474907]
Video anomaly detection (VAD) holds immense importance across diverse domains such as surveillance, healthcare, and environmental monitoring.
This survey explores deep learning-based VAD, expanding beyond traditional supervised training paradigms to encompass emerging weakly supervised, self-supervised, and unsupervised approaches.
arXiv Detail & Related papers (2024-05-29T17:56:31Z) - Hawk: Learning to Understand Open-World Video Anomalies [76.9631436818573]
Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs.
We introduce Hawk, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely.
We have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users' open-world questions.
arXiv Detail & Related papers (2024-05-27T07:08:58Z) - VisionGPT: LLM-Assisted Real-Time Anomaly Detection for Safe Visual Navigation [3.837186701755568]
This paper explores the potential of Large Language Models in zero-shot anomaly detection for safe visual navigation.
The proposed framework can identify anomalies within camera-captured frames that include any possible obstacles, then generate concise, audio-delivered descriptions emphasizing abnormalities.
arXiv Detail & Related papers (2024-03-19T03:55:39Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - Sparsity-Guided Holistic Explanation for LLMs with Interpretable
Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains.
The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications.
We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.