Related papers: Harnessing Large Language Models for Training-free Video Anomaly Detection

Harnessing Large Language Models for Training-free Video Anomaly Detection

URL: http://arxiv.org/abs/2404.01014v1
Date: Mon, 1 Apr 2024 09:34:55 GMT
Title: Harnessing Large Language Models for Training-free Video Anomaly Detection
Authors: Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, Elisa Ricci,
Abstract summary: Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Training-based methods are prone to be domain-specific, thus being costly for practical deployment. We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
Score: 34.76811491190446
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.

Related papers

Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection [52.5174167737992]
Video anomaly detection (VAD) aims to identify abnormal events in videos.<n>We propose SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations.<n>Our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data.
arXiv Detail & Related papers (2026-02-27T13:48:50Z)
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens [69.97021957331326]
We propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization.<n>We also introduce a VAE branch with linear projection to recover fine-grained image details.
arXiv Detail & Related papers (2025-12-02T09:02:20Z)
VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion [15.486565360380203]
ours is a novel unsupervised multi-class visual anomaly detection framework.<n>It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection.<n>Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset.
arXiv Detail & Related papers (2025-11-11T12:37:38Z)
Rethinking Visual Intelligence: Insights from Video Pretraining [75.32388528274224]
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems.<n>We investigate Video Diffusion Models (VDMs) as a promising direction for bridging the gap.
arXiv Detail & Related papers (2025-10-28T14:12:11Z)
MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection [28.5803063507761]
Video Anomaly Detection (VAD) aims to locate unusual activities or behaviors within videos.<n>Online VAD has seldom received attention due to real-time constraints and computational intensity.<n>We introduce a novel Memory-based online scoring queue scheme for Training-free VAD (MoniTor)
arXiv Detail & Related papers (2025-10-24T13:28:29Z)
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs [60.224522472631776]
We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models.<n>Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video.<n>We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings.
arXiv Detail & Related papers (2025-10-19T22:12:45Z)
HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs [8.18063726177317]
Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences.<n>We propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning.
arXiv Detail & Related papers (2025-07-23T10:41:46Z)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z)
EventVAD: Training-Free Event-Aware Video Anomaly Detection [19.714436150837148]
EventVAD is an event-aware video anomaly detection framework. It combines tailored dynamic graph architectures and multimodal-event reasoning. It achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.
arXiv Detail & Related papers (2025-04-17T16:59:04Z)
Learning from Streaming Video with Orthogonal Gradients [62.51504086522027]
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks.
arXiv Detail & Related papers (2025-04-02T17:59:57Z)
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM [1.7051307941715268]
Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. Existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. This study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model.
arXiv Detail & Related papers (2025-03-06T14:52:34Z)
Efficient Transfer Learning for Video-language Foundation Models [13.166348605993292]
We propose a simple yet effective Multi-modal Spatio-supervised (MSTA) to improve the alignment between representations in the text and vision branches. We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-valiant, and fully-language learning.
arXiv Detail & Related papers (2024-11-18T01:25:58Z)
Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs) Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z)
Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task. OV-STAD requires training a model on a limited set of base classes with box and label supervision. To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z)
Test-Time Zero-Shot Temporal Action Localization [58.84919541314969]
ZS-TAL seeks to identify and locate actions in untrimmed videos unseen during training. Training-based ZS-TAL approaches assume the availability of labeled data for supervised learning. We introduce a novel method that performs Test-Time adaptation for Temporal Action localization (T3AL)
arXiv Detail & Related papers (2024-04-08T11:54:49Z)
Video Anomaly Detection and Explanation via Large Language Models [34.52845566893497]
Video Anomaly Detection (VAD) aims to localize abnormal events on the timeline of long-range surveillance videos. In this paper, we conduct pioneer research on equipping video-based large language models (VLLMs) in the framework of VAD. We introduce a novel network module Long-Term Context (LTC) to mitigate the incapability of VLLMs in long-range context modeling.
arXiv Detail & Related papers (2024-01-11T07:09:44Z)
Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame. ATM outperforms strong video pre-training baselines by 80% on average. We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models [67.31684040281465]
We present textbfMOV, a simple yet effective method for textbfMultimodal textbfOpen-textbfVocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram.
arXiv Detail & Related papers (2022-07-15T17:59:11Z)
Scanflow: A multi-graph framework for Machine Learning workflow management, supervision, and debugging [0.0]
We propose a novel containerized directed graph framework to support end-to-end Machine Learning workflow management. The framework allows defining and deploying ML in containers, tracking their metadata, checking their behavior in production, and improving the models by using both learned and human-provided knowledge.
arXiv Detail & Related papers (2021-11-04T17:01:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.