Related papers: Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

URL: http://arxiv.org/abs/2404.08531v1
Date: Fri, 12 Apr 2024 15:18:25 GMT
Title: Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection
Authors: Zhiwei Yang, Jing Liu, Peng Wu,
Abstract summary: We propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance for WSVAD. Our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole.
Score: 10.269746485037935
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole

Related papers

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP [44.51452778561945]
Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing approaches rely primarily on parameter-efficient fine-tuning of image-text pre-trained models. We propose a video-to-text discretization framework to address limited interpretability and poor generalization due to inadequate temporal modeling.
arXiv Detail & Related papers (2025-03-24T07:27:19Z)
Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos. We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z)
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model. Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z)
Videoprompter: an ensemble of foundational models for zero-shot video understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z)
Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization [0.0]
AVEL is the task of temporally localizing and classifying emphaudio-visual events, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels are available as supervision for training. Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels.
arXiv Detail & Related papers (2023-07-12T18:13:58Z)
Query-based Video Summarization with Pseudo Label Supervision [19.229722872058055]
Existing datasets for manually labelled query-based video summarization are costly and thus small. Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels. Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-04T22:28:17Z)
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided. We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video. Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z)
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR) Existing methods rely on separate pre-training feature extractors for visual and textual understanding. We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z)
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks. We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment. Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.