Related papers: Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

URL: http://arxiv.org/abs/2412.20455v1
Date: Sun, 29 Dec 2024 12:46:57 GMT
Title: Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection
Authors: Ayush Ghadiya, Purbayan Kar, Vishal Chudasama, Pankaj Wasnik,
Abstract summary: weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction.<n>We propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity.<n>We show that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.
Score: 2.749898166276854
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper, we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework, we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA), which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally, we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations, thereby enhancing feature separation accuracy. Through extensive experiments, we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.

Related papers

AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation [5.0923114224599555]
This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN. Our approach circumvents the limitations of previous methods by avoiding heavy gradient computations on large multimodal models. Our model provides a practical and efficient solution for real-time video analysis without the constraints of previous segmentation-based or multimodal approaches.
arXiv Detail & Related papers (2024-06-27T01:09:07Z)
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos. Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models. We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z)
Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection [7.127829790714167]
Skeleton-based video anomaly detection (SVAD) is a crucial task in computer vision. This paper introduces a novel, practical and lightweight framework, namely Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection (GiCiSAD) experiments on four widely used skeleton-based video datasets show that GiCiSAD outperforms existing methods with significantly fewer training parameters.
arXiv Detail & Related papers (2024-03-18T18:42:32Z)
Dynamic Erasing Network Based on Multi-Scale Temporal Features for Weakly Supervised Video Anomaly Detection [103.92970668001277]
We propose a Dynamic Erasing Network (DE-Net) for weakly supervised video anomaly detection. We first propose a multi-scale temporal modeling module, capable of extracting features from segments of varying lengths. Then, we design a dynamic erasing strategy, which dynamically assesses the completeness of the detected anomalies.
arXiv Detail & Related papers (2023-12-04T09:40:11Z)
BatchNorm-based Weakly Supervised Video Anomaly Detection [117.11382325721016]
In weakly supervised video anomaly detection, temporal features of abnormal events often exhibit outlier characteristics. We propose a novel method, BN-WVAD, which incorporates BatchNorm into WVAD. The proposed BN-WVAD model demonstrates state-of-the-art performance on UCF-Crime with an AUC of 87.24%, and XD-Violence, where AP reaches up to 84.93%.
arXiv Detail & Related papers (2023-11-26T17:47:57Z)
Open-Vocabulary Video Anomaly Detection [57.552523669351636]
Video anomaly detection (VAD) with weak supervision has achieved remarkable performance in utilizing video-level labels to discriminate whether a video frame is normal or abnormal. Recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos. This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies.
arXiv Detail & Related papers (2023-11-13T02:54:17Z)
MGFN: Magnitude-Contrastive Glance-and-Focus Network for Weakly-Supervised Video Anomaly Detection [39.923871347007875]
We propose a novel glance and focus network to integrate spatial-temporal information for accurate anomaly detection. Existing approaches that use feature magnitudes to represent the degree of anomalies typically ignore the effects of scene variations. We propose the Feature Amplification Mechanism and a Magnitude Contrastive Loss to enhance the discriminativeness of feature magnitudes for detecting anomalies.
arXiv Detail & Related papers (2022-11-28T07:10:36Z)
AntPivot: Livestream Highlight Detection via Hierarchical Attention Mechanism [64.70568612993416]
We formulate a new task Livestream Highlight Detection, discuss and analyze the difficulties listed above and propose a novel architecture AntPivot to solve this problem. We construct a fully-annotated dataset AntHighlight to instantiate this task and evaluate the performance of our model.
arXiv Detail & Related papers (2022-06-10T05:58:11Z)
A Video Anomaly Detection Framework based on Appearance-Motion Semantics Representation Consistency [18.06814233420315]
We propose a framework that uses normal data's appearance and motion semantic representation consistency to handle anomaly detection. We design a two-stream encoder to encode the appearance and motion information representations of normal samples. Lower consistency of appearance and motion features of anomalous samples can be used to generate predicted frames with larger reconstruction error.
arXiv Detail & Related papers (2022-04-08T15:59:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.