Knowledge-Guided Textual Reasoning for Explainable Video Anomaly Detection via LLMs
- URL: http://arxiv.org/abs/2511.07429v1
- Date: Thu, 30 Oct 2025 01:18:55 GMT
- Title: Knowledge-Guided Textual Reasoning for Explainable Video Anomaly Detection via LLMs
- Authors: Hari Lee,
- Abstract summary: We introduce Text-based Explainable Video Anomaly Detection (TbVAD), a language-driven framework for weakly supervised video anomaly detection.<n>TbVAD represents video semantics through language, enabling interpretable and knowledge-grounded reasoning.<n>We evaluate TbVAD on two public benchmarks, UCF-Crime and XD-Violence, demonstrating that textual knowledge reasoning provides interpretable and reliable anomaly detection.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce Text-based Explainable Video Anomaly Detection (TbVAD), a language-driven framework for weakly supervised video anomaly detection that performs anomaly detection and explanation entirely within the textual domain. Unlike conventional WSVAD models that rely on explicit visual features, TbVAD represents video semantics through language, enabling interpretable and knowledge-grounded reasoning. The framework operates in three stages: (1) transforming video content into fine-grained captions using a vision-language model, (2) constructing structured knowledge by organizing the captions into four semantic slots (action, object, context, environment), and (3) generating slot-wise explanations that reveal which semantic factors contribute most to the anomaly decision. We evaluate TbVAD on two public benchmarks, UCF-Crime and XD-Violence, demonstrating that textual knowledge reasoning provides interpretable and reliable anomaly detection for real-world surveillance scenarios.
Related papers
- In-Video Instructions: Visual Signals as Generative Control [79.44662698914401]
We investigate whether capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions.<n>In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories.<n>Experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions.
arXiv Detail & Related papers (2025-11-24T18:38:45Z) - From Prediction to Explanation: Multimodal, Explainable, and Interactive Deepfake Detection Framework for Non-Expert Users [21.627851460651968]
We present DF-P2E (Deepfake: Prediction to Explanation), a novel framework that integrates visual, semantic, and narrative layers of explanation to make deepfake detection interpretable and accessible.<n>We instantiate and evaluate the framework on the DF40 benchmark, the most diverse deepfake dataset to date.<n> Experiments demonstrate that our system achieves competitive detection performance while providing high-quality explanations aligned with Grad-CAM activations.
arXiv Detail & Related papers (2025-08-11T03:55:47Z) - VidText: Towards Comprehensive Evaluation for Video Text Understanding [56.121054697977115]
VidText is a benchmark for comprehensive and in-depth evaluation of video text understanding.<n>It covers a wide range of real-world scenarios and supports multilingual content.<n>It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks.
arXiv Detail & Related papers (2025-05-28T19:39:35Z) - Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection [18.125287697902813]
Current Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data.<n>We present a novel framework that unlocks LVLMs' potential capabilities for deepfake detection.
arXiv Detail & Related papers (2025-03-19T03:20:03Z) - Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions [3.9633773442108873]
We propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration.<n>The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity.
arXiv Detail & Related papers (2025-03-07T07:15:06Z) - Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight [2.290956583394892]
Video anomaly detection (VAD) has witnessed significant advancements through the integration of large language models (LLMs) and vision-language models (VLMs)<n>This paper presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024.
arXiv Detail & Related papers (2024-12-24T09:05:37Z) - Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector.
It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training.
Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z) - Hierarchical Modular Network for Video Captioning [162.70349114104107]
We propose a hierarchical modular network to bridge video representations and linguistic semantics from three levels before generating captions.
The proposed method performs favorably against the state-of-the-art models on the two widely-used benchmarks: MSVD 104.0% and MSR-VTT 51.5% in CIDEr score.
arXiv Detail & Related papers (2021-11-24T13:07:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.