Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection
- URL: http://arxiv.org/abs/2510.14896v1
- Date: Thu, 16 Oct 2025 17:13:33 GMT
- Title: Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection
- Authors: Furkan Mumcu, Michael J. Jones, Anoop Cherian, Yasin Yilmaz,
- Abstract summary: We propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs)<n>Our method focuses on extracting and interpreting object activity and interactions over time.<n>Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability.
- Score: 33.77002721234086
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.
Related papers
- Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection [52.5174167737992]
Video anomaly detection (VAD) aims to identify abnormal events in videos.<n>We propose SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations.<n>Our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data.
arXiv Detail & Related papers (2026-02-27T13:48:50Z) - VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models [29.213430569936943]
We propose VADER, an LLM-driven framework for Video Anomaly unDErstanding.<n>VADER integrates object features with visual cues to enhance anomaly comprehension from video.<n> Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks.
arXiv Detail & Related papers (2025-11-10T16:56:11Z) - IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection [70.02774285130238]
This paper explores the combination of rich text semantics with both image-level and pixel-level information from images.<n>We propose IAD-GPT, a novel paradigm based on MLLMs for Industrial Anomaly Detection.<n>Experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance.
arXiv Detail & Related papers (2025-10-16T02:48:05Z) - Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z) - Aligning Effective Tokens with Video Anomaly in Large Language Models [52.620554265703916]
We propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos.<n>Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules.<n>We construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs.
arXiv Detail & Related papers (2025-08-08T14:30:05Z) - HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs [8.18063726177317]
Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences.<n>We propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning.
arXiv Detail & Related papers (2025-07-23T10:41:46Z) - EventVAD: Training-Free Event-Aware Video Anomaly Detection [19.714436150837148]
EventVAD is an event-aware video anomaly detection framework.<n>It combines tailored dynamic graph architectures and multimodal-event reasoning.<n>It achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.
arXiv Detail & Related papers (2025-04-17T16:59:04Z) - Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation [61.64052577026623]
Real-world multi-view datasets are often heterogeneous and imperfect.<n>We propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment.<n>Our RML is self-supervised and can also be applied for downstream tasks as a regularization.
arXiv Detail & Related papers (2025-03-06T07:01:08Z) - Large Models in Dialogue for Active Perception and Anomaly Detection [35.16837804526144]
We propose a framework to actively collect information and perform anomaly detection in novel scenes.<n>Two deep learning models engage in a dialogue to actively control a drone to increase perception and anomaly detection accuracy.<n>In addition to information gathering, our approach is utilized for anomaly detection and our results demonstrate the proposed methods effectiveness.
arXiv Detail & Related papers (2025-01-27T18:38:36Z) - Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing [2.0528748158119434]
multimodal learning can be used to integrate features from different data modalities, thereby improving detection accuracy.
In this paper, we propose to use Masked Image Modeling (MIM) as a pre-training technique, leveraging self-supervised learning on unlabeled data.
To address this, we propose a new interactive MIM method that can establish interactions between different tokens, which is particularly beneficial for object detection in remote sensing.
arXiv Detail & Related papers (2024-09-13T14:50:50Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.<n>Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.<n>We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.