VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation
- URL: http://arxiv.org/abs/2407.12276v1
- Date: Wed, 17 Jul 2024 02:54:41 GMT
- Title: VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation
- Authors: Zhen Qu, Xian Tao, Mukesh Prasad, Fei Shen, Zhengtao Zhang, Xinyi Gong, Guiguang Ding,
- Abstract summary: We propose a visual context prompting model (VCP-CLIP) for ZSAS task based on CLIP.
In specific, we first design a Pre-VCP module to embed global visual information into the text prompt.
Then, we propose a novel Post-VCP module, that adjusts the text embeddings utilizing the fine-grained features of the images.
- Score: 19.83954061346437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large-scale vision-language models such as CLIP have demonstrated immense potential in zero-shot anomaly segmentation (ZSAS) task, utilizing a unified model to directly detect anomalies on any unseen product with painstakingly crafted text prompts. However, existing methods often assume that the product category to be inspected is known, thus setting product-specific text prompts, which is difficult to achieve in the data privacy scenarios. Moreover, even the same type of product exhibits significant differences due to specific components and variations in the production process, posing significant challenges to the design of text prompts. In this end, we propose a visual context prompting model (VCP-CLIP) for ZSAS task based on CLIP. The insight behind VCP-CLIP is to employ visual context prompting to activate CLIP's anomalous semantic perception ability. In specific, we first design a Pre-VCP module to embed global visual information into the text prompt, thus eliminating the necessity for product-specific prompts. Then, we propose a novel Post-VCP module, that adjusts the text embeddings utilizing the fine-grained features of the images. In extensive experiments conducted on 10 real-world industrial anomaly segmentation datasets, VCP-CLIP achieved state-of-the-art performance in ZSAS task. The code is available at https://github.com/xiaozhen228/VCP-CLIP.
Related papers
- IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain [40.584137588388245]
IQE-CLIP is an innovative framework for anomaly detection tasks in medical domain.<n>We introduce class-based prompting tokens and learnable prompting tokens for better adaptation of CLIP to the medical domain.<n>Our framework achieves state-of-the-art on both zero-shot and few-shot tasks.
arXiv Detail & Related papers (2025-06-12T14:23:06Z) - ConText: Driving In-context Learning for Text Removal and Segmentation [59.6299939669307]
This paper presents the first study on adapting the visual in-context learning paradigm to optical character recognition tasks.<n>We propose a task-chaining compositor in the form of image-removal-segmentation.<n>We also introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation.
arXiv Detail & Related papers (2025-06-04T10:06:32Z) - ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection [2.622385361961154]
Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data.<n>Existing CLIP-based methods attempt to activate the model's ZSAD potential via handcrafted or static learnable prompts.<n>ViP$2$-CLIP fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts.
arXiv Detail & Related papers (2025-05-23T10:01:11Z) - KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration [9.688664292809785]
Zero-shot anomaly detection (ZSAD) identifies anomalies without needing training samples from the target dataset.
vision-language models like CLIP show potential in ZSAD but have limitations.
We introduce KAnoCLIP, a novel ZSAD framework that leverages vision-language models.
KAnoCLIP achieves state-of-the-art performance in ZSAD across 12 industrial and medical datasets.
arXiv Detail & Related papers (2025-01-07T13:51:41Z) - CLIP-FSAC++: Few-Shot Anomaly Classification with Anomaly Descriptor Based on CLIP [22.850815902535988]
We propose an effective few-shot anomaly classification framework with one-stage training, dubbed CLIP-FSAC++.
In anomaly descriptor, image-to-text cross-attention module is used to obtain image-specific text embeddings.
Comprehensive experiment results are provided for evaluating our method in few-normal shot anomaly classification on VisA and MVTEC-AD for 1, 2, 4 and 8-shot settings.
arXiv Detail & Related papers (2024-12-05T02:44:45Z) - GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection [5.530212768657544]
We introduce glocal contrastive learning to improve the learning of global and local prompts, effectively detecting abnormal patterns across various domains.
The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets from both the industrial and medical domains.
arXiv Detail & Related papers (2024-11-09T05:22:13Z) - DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks [31.850184662606562]
We introduce DetailCLIP: A Detail-Oriented CLIP to address the limitations of contrastive learning-based vision-language models.
We show that DetailCLIP surpasses existing CLIP-based and traditional self-supervised learning (SSL) models in segmentation accuracy and exhibits superior generalization across diverse datasets.
arXiv Detail & Related papers (2024-09-10T18:27:36Z) - Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation [72.47110803885235]
We introduce a novel framework named Cascade-CLIP for zero-shot semantic segmentation.
Our framework achieves superior zero-shot performance on segmentation benchmarks like COCO-Stuff, Pascal-VOC, and Pascal-Context.
arXiv Detail & Related papers (2024-06-02T08:32:51Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - TF-CLIP: Learning Text-free CLIP for Video-based Person
Re-Identification [60.5843635938469]
We propose a novel one-stage text-free CLIP-based learning framework named TF-CLIP for video-based person ReID.
More specifically, we extract the identity-specific sequence feature as the CLIP-Memory to replace the text feature.
Our proposed method shows much better results than other state-of-the-art methods on MARS, LS-VID and iLIDS-VID.
arXiv Detail & Related papers (2023-12-15T09:10:05Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot
Anomaly Localization [63.61093388441298]
Contrastive Language-Image Pre-training models have shown promising performance on zero-shot visual recognition tasks.
In this work, we propose AnoCLIP for zero-shot anomaly localization.
arXiv Detail & Related papers (2023-08-30T10:35:36Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation [26.405789621523137]
We address zero-shot and few-normal-shot anomaly classification and segmentation.
We propose window-based CLIP (WinCLIP) with a compositional ensemble on state words and prompt templates.
We also propose its few-normal-shot extension WinCLIP+, which uses complementary information from normal images.
arXiv Detail & Related papers (2023-03-26T20:41:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.