Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance
- URL: http://arxiv.org/abs/2602.10549v1
- Date: Wed, 11 Feb 2026 05:44:30 GMT
- Title: Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance
- Authors: Shengyang Sun, Jiashen Hua, Junyi Feng, Xiaojin Gong,
- Abstract summary: Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms.<n> extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances.<n> multimodal fusion often suffers from redundancy and imbalance.
- Score: 10.079930398169205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.
Related papers
- Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective [15.313681588364242]
We introduce a novel denoising framework called IB-IUMAD, which exploits the complementary benefits of the Mamba decoder and information bottleneck fusion module.<n>A series of theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrates the effectiveness and competitive performance of IB-IUMAD.
arXiv Detail & Related papers (2026-03-03T05:58:35Z) - PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization [9.018570847586878]
We propose PromptMAD, a cross-modal prompting framework for unsupervised visual anomaly detection and localization.<n>Our method enriches visual reconstruction with semantic context, improving the detection of subtle and textural anomalies.<n>Our architecture also includes a supervised segmentor that fuses multi-scale convolutional features with Transformer-based spatial attention.
arXiv Detail & Related papers (2026-01-30T03:04:06Z) - Multimodal Rumor Detection Enhanced by External Evidence and Forgery Features [21.522558828688343]
Social media increasingly disseminates information through mixed image text posts.<n>Deep semantic mismatch rumors pose particular challenges and threaten online public opinion.<n>Existing multimodal rumor detection methods suffer from limited feature extraction, noisy alignment, and inflexible fusion strategies.<n>We propose a multimodal rumor detection model enhanced with external evidence and forgery features.
arXiv Detail & Related papers (2026-01-21T12:53:18Z) - GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection [63.16754542429089]
We propose Grained Multi-modal Feature for Video Anomaly Detection (GMFVAD)<n>We generate more grained multi-modal feature based on the video snippet, which summarizes the main content.<n> Experiments show that GMFVAD achieves state-of-the-art performance on four mainly datasets.
arXiv Detail & Related papers (2025-10-23T06:52:53Z) - Beyond Simple Fusion: Adaptive Gated Fusion for Robust Multimodal Sentiment Analysis [27.11612547025828]
We introduce textbfAdaptive textbfGated textbfFusion textbfNetwork that adaptively adjusts feature weights based on information entropy and modality importance.<n>Experiments on CMU-MOSI and CMU-MOSEI show that AGFN significantly outperforms strong baselines in accuracy, effectively discerning subtle emotions with robust performance.
arXiv Detail & Related papers (2025-10-02T05:05:41Z) - Diversity Boosts AI-Generated Text Detection [51.56484100374058]
DivEye is a novel framework that captures how unpredictability fluctuates across a text using surprisal-based features.<n>Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines.
arXiv Detail & Related papers (2025-09-23T10:21:22Z) - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection [9.145305176998447]
Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities.
We propose a new weakly supervised MVD method that explicitly addresses the challenges of information redundancy, modality imbalance, and modality asynchrony.
Experiments on the largest-scale XD-Violence dataset demonstrate that the proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-05-08T15:27:08Z) - Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - A novel multimodal dynamic fusion network for disfluency detection in
spoken utterances [43.79216238760557]
We propose a novel multimodal architecture for disfluency detection from individual utterances.
Our architecture leverages a multimodal dynamic fusion network that adds minimal parameters over an existing text encoder.
We show that our proposed model achieves state-of-the-art results on the widely used English Switchboard for disfluency detection.
arXiv Detail & Related papers (2022-11-27T01:54:22Z) - Multi-Fact Correction in Abstractive Text Summarization [98.27031108197944]
Span-Fact is a suite of two factual correction models that leverages knowledge learned from question answering models to make corrections in system-generated summaries via span selection.
Our models employ single or multi-masking strategies to either iteratively or auto-regressively replace entities in order to ensure semantic consistency w.r.t. the source text.
Experiments show that our models significantly boost the factual consistency of system-generated summaries without sacrificing summary quality in terms of both automatic metrics and human evaluation.
arXiv Detail & Related papers (2020-10-06T02:51:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.