VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video
Anomaly Detection
- URL: http://arxiv.org/abs/2308.11681v3
- Date: Fri, 15 Dec 2023 09:42:25 GMT
- Title: VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video
Anomaly Detection
- Authors: Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng
Wang, Yanning Zhang
- Abstract summary: We propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD)
VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP.
We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD.
- Score: 58.47940430618352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent contrastive language-image pre-training (CLIP) model has shown
great success in a wide range of image-level tasks, revealing remarkable
ability for learning powerful visual representations with rich semantics. An
open and worthwhile problem is efficiently adapting such a strong model to the
video domain and designing a robust video anomaly detector. In this work, we
propose VadCLIP, a new paradigm for weakly supervised video anomaly detection
(WSVAD) by leveraging the frozen CLIP model directly without any pre-training
and fine-tuning process. Unlike current works that directly feed extracted
features into the weakly supervised classifier for frame-level binary
classification, VadCLIP makes full use of fine-grained associations between
vision and language on the strength of CLIP and involves dual branch. One
branch simply utilizes visual features for coarse-grained binary
classification, while the other fully leverages the fine-grained language-image
alignment. With the benefit of dual branch, VadCLIP achieves both
coarse-grained and fine-grained video anomaly detection by transferring
pre-trained knowledge from CLIP to WSVAD task. We conduct extensive experiments
on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best
performance on both coarse-grained and fine-grained WSVAD, surpassing the
state-of-the-art methods by a large margin. Specifically, VadCLIP achieves
84.51% AP and 88.02% AUC on XD-Violence and UCF-Crime, respectively. Code and
features are released at https://github.com/nwpu-zxr/VadCLIP.
Related papers
- Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia [45.93202559299953]
This paper introduces an alternative way for CLIP adaptation without adding 'external' parameters to optimize.
We find that simply fine-tuning the last projection matrix of the vision leads to strong performance compared to the existing baselines.
Perhaps surprisingly, this approach, coined ProLIP, yields performances on par or better than state of the art on 11 few-shot classification benchmarks.
arXiv Detail & Related papers (2024-10-07T17:59:59Z) - Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities.
CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure.
We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z) - CLIP-guided Prototype Modulating for Few-shot Action Recognition [49.11385095278407]
This work aims to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue.
We present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of a video-text contrastive objective and a prototype modulation.
arXiv Detail & Related papers (2023-03-06T09:17:47Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - PointCLIP: Point Cloud Understanding by CLIP [77.02399444893963]
We propose PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts.
PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime.
arXiv Detail & Related papers (2021-12-04T19:42:40Z) - How Much Can CLIP Benefit Vision-and-Language Tasks? [121.46042421728016]
We show that CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks.
We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
arXiv Detail & Related papers (2021-07-13T20:48:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.