Related papers: Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

URL: http://arxiv.org/abs/2602.10549v1
Date: Wed, 11 Feb 2026 05:44:30 GMT
Title: Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance
Authors: Shengyang Sun, Jiashen Hua, Junyi Feng, Xiaojin Gong,
Abstract summary: Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms.<n> extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances.<n> multimodal fusion often suffers from redundancy and imbalance.
Score: 10.079930398169205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.

Related papers

Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective [15.313681588364242]
We introduce a novel denoising framework called IB-IUMAD, which exploits the complementary benefits of the Mamba decoder and information bottleneck fusion module.<n>A series of theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrates the effectiveness and competitive performance of IB-IUMAD.
arXiv Detail & Related papers (2026-03-03T05:58:35Z)
PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization [9.018570847586878]
We propose PromptMAD, a cross-modal prompting framework for unsupervised visual anomaly detection and localization.<n>Our method enriches visual reconstruction with semantic context, improving the detection of subtle and textural anomalies.<n>Our architecture also includes a supervised segmentor that fuses multi-scale convolutional features with Transformer-based spatial attention.
arXiv Detail & Related papers (2026-01-30T03:04:06Z)
Multimodal Rumor Detection Enhanced by External Evidence and Forgery Features [21.522558828688343]
Social media increasingly disseminates information through mixed image text posts.<n>Deep semantic mismatch rumors pose particular challenges and threaten online public opinion.<n>Existing multimodal rumor detection methods suffer from limited feature extraction, noisy alignment, and inflexible fusion strategies.<n>We propose a multimodal rumor detection model enhanced with external evidence and forgery features.
arXiv Detail & Related papers (2026-01-21T12:53:18Z)
GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection [63.16754542429089]
We propose Grained Multi-modal Feature for Video Anomaly Detection (GMFVAD)<n>We generate more grained multi-modal feature based on the video snippet, which summarizes the main content.<n> Experiments show that GMFVAD achieves state-of-the-art performance on four mainly datasets.
arXiv Detail & Related papers (2025-10-23T06:52:53Z)
Beyond Simple Fusion: Adaptive Gated Fusion for Robust Multimodal Sentiment Analysis [27.11612547025828]
We introduce textbfAdaptive textbfGated textbfFusion textbfNetwork that adaptively adjusts feature weights based on information entropy and modality importance.<n>Experiments on CMU-MOSI and CMU-MOSEI show that AGFN significantly outperforms strong baselines in accuracy, effectively discerning subtle emotions with robust performance.
arXiv Detail & Related papers (2025-10-02T05:05:41Z)
Diversity Boosts AI-Generated Text Detection [51.56484100374058]
DivEye is a novel framework that captures how unpredictability fluctuates across a text using surprisal-based features.<n>Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines.
arXiv Detail & Related papers (2025-09-23T10:21:22Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection [9.145305176998447]
Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities. We propose a new weakly supervised MVD method that explicitly addresses the challenges of information redundancy, modality imbalance, and modality asynchrony. Experiments on the largest-scale XD-Violence dataset demonstrate that the proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-05-08T15:27:08Z)
Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
A novel multimodal dynamic fusion network for disfluency detection in spoken utterances [43.79216238760557]
We propose a novel multimodal architecture for disfluency detection from individual utterances. Our architecture leverages a multimodal dynamic fusion network that adds minimal parameters over an existing text encoder. We show that our proposed model achieves state-of-the-art results on the widely used English Switchboard for disfluency detection.
arXiv Detail & Related papers (2022-11-27T01:54:22Z)
Multi-Fact Correction in Abstractive Text Summarization [98.27031108197944]
Span-Fact is a suite of two factual correction models that leverages knowledge learned from question answering models to make corrections in system-generated summaries via span selection. Our models employ single or multi-masking strategies to either iteratively or auto-regressively replace entities in order to ensure semantic consistency w.r.t. the source text. Experiments show that our models significantly boost the factual consistency of system-generated summaries without sacrificing summary quality in terms of both automatic metrics and human evaluation.
arXiv Detail & Related papers (2020-10-06T02:51:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.