CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly
Detection
- URL: http://arxiv.org/abs/2311.00453v2
- Date: Sat, 2 Mar 2024 13:54:31 GMT
- Title: CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly
Detection
- Authors: Xuhai Chen, Jiangning Zhang, Guanzhong Tian, Haoyang He, Wuhao Zhang,
Yabiao Wang, Chengjie Wang, Yong Liu
- Abstract summary: We propose a framework called CLIP-AD to leverage the zero-shot capabilities of the large vision-language model CLIP.
We note opposite predictions and irrelevant highlights in the direct computation of the anomaly maps.
- Score: 49.510604614688745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper considers zero-shot Anomaly Detection (AD), performing AD without
reference images of the test objects. We propose a framework called CLIP-AD to
leverage the zero-shot capabilities of the large vision-language model CLIP.
Firstly, we reinterpret the text prompts design from a distributional
perspective and propose a Representative Vector Selection (RVS) paradigm to
obtain improved text features. Secondly, we note opposite predictions and
irrelevant highlights in the direct computation of the anomaly maps. To address
these issues, we introduce a Staged Dual-Path model (SDP) that leverages
features from various levels and applies architecture and feature surgery.
Lastly, delving deeply into the two phenomena, we point out that the image and
text features are not aligned in the joint embedding space. Thus, we introduce
a fine-tuning strategy by adding linear layers and construct an extended model
SDP+, further enhancing the performance. Abundant experiments demonstrate the
effectiveness of our approach, e.g., on MVTec-AD, SDP outperforms the SOTA
WinCLIP by +4.2/+10.7 in segmentation metrics F1-max/PRO, while SDP+ achieves
+8.3/+20.5 improvements.
Related papers
- PIP: Perturbation-based Iterative Pruning for Large Language Models [5.511065308044068]
We propose PIP (Perturbation-based Iterative Pruning), a novel double-view structured pruning method to optimize Large Language Models.
Our experiments show that PIP reduces the parameter count by approximately 20% while retaining over 85% of the original model's accuracy.
arXiv Detail & Related papers (2025-01-25T17:10:50Z) - Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia [45.93202559299953]
This paper introduces an alternative way for CLIP adaptation without adding 'external' parameters to optimize.
We find that simply fine-tuning the last projection matrix of the vision leads to performance better than all baselines.
This simple approach, coined ProLIP, yields state-of-the-art performance on 11 few-shot classification benchmarks.
arXiv Detail & Related papers (2024-10-07T17:59:59Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - Sample Complexity Characterization for Linear Contextual MDPs [67.79455646673762]
Contextual decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable.
CMDPs serve as an important framework to model many real-world applications with time-varying environments.
We study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights.
arXiv Detail & Related papers (2024-02-05T03:25:04Z) - Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion [23.62010759076202]
We formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels.
Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy.
arXiv Detail & Related papers (2023-12-17T11:59:14Z) - VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video
Anomaly Detection [58.47940430618352]
We propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD)
VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP.
We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD.
arXiv Detail & Related papers (2023-08-22T14:58:36Z) - Adapting Contrastive Language-Image Pretrained (CLIP) Models for
Out-of-Distribution Detection [1.597617022056624]
We present a comprehensive experimental study on pretrained feature extractors for visual out-of-distribution (OOD) detection.
We propose a new simple and scalable method called textitpseudo-label probing (PLP) that adapts vision-language models for OOD detection.
arXiv Detail & Related papers (2023-03-10T10:02:18Z) - SPTS v2: Single-Point Scene Text Spotting [146.98118405786445]
New framework, SPTS v2, allows us to train high-performing text-spotting models using a single-point annotation.
Tests show SPTS v2 can outperform previous state-of-the-art single-point text spotters with fewer parameters.
Experiments suggest a potential preference for single-point representation in scene text spotting.
arXiv Detail & Related papers (2023-01-04T14:20:14Z) - CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention [31.84299688413136]
Contrastive Language-Image Pre-training has been shown to learn visual representations with great transferability.
Existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets.
We introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module.
arXiv Detail & Related papers (2022-09-28T15:22:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.