WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
- URL: http://arxiv.org/abs/2303.14814v1
- Date: Sun, 26 Mar 2023 20:41:21 GMT
- Title: WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
- Authors: Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash
Ravichandran, Onkar Dabeer
- Abstract summary: We address zero-shot and few-normal-shot anomaly classification and segmentation.
We propose window-based CLIP (WinCLIP) with a compositional ensemble on state words and prompt templates.
We also propose its few-normal-shot extension WinCLIP+, which uses complementary information from normal images.
- Score: 26.405789621523137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual anomaly classification and segmentation are vital for automating
industrial quality inspection. The focus of prior research in the field has
been on training custom models for each quality inspection task, which requires
task-specific images and annotation. In this paper we move away from this
regime, addressing zero-shot and few-normal-shot anomaly classification and
segmentation. Recently CLIP, a vision-language model, has shown revolutionary
generality with competitive zero-/few-shot performance in comparison to
full-supervision. But CLIP falls short on anomaly classification and
segmentation tasks. Hence, we propose window-based CLIP (WinCLIP) with (1) a
compositional ensemble on state words and prompt templates and (2) efficient
extraction and aggregation of window/patch/image-level features aligned with
text. We also propose its few-normal-shot extension WinCLIP+, which uses
complementary information from normal images. In MVTec-AD (and VisA), without
further tuning, WinCLIP achieves 91.8%/85.1% (78.1%/79.6%) AUROC in zero-shot
anomaly classification and segmentation while WinCLIP+ does 93.1%/95.2%
(83.8%/96.4%) in 1-normal-shot, surpassing state-of-the-art by large margins.
Related papers
- Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer-language representations.
SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Transductive Zero-Shot and Few-Shot CLIP [24.592841797020203]
This paper addresses the transductive zero-shot and few-shot CLIP classification challenge.
Inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently.
Our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance.
arXiv Detail & Related papers (2024-04-08T12:44:31Z) - Investigating the Limitation of CLIP Models: The Worst-Performing
Categories [53.360239882501325]
Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts.
It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts.
However, we found that their performance in the worst categories is significantly inferior to the overall performance.
arXiv Detail & Related papers (2023-10-05T05:37:33Z) - Zero-Shot Visual Classification with Guided Cropping [9.321383320998262]
We propose an off-the-shelf zero-shot object detection model in a preprocessing step to increase focus of zero-shot classifier to the object of interest.
We empirically show that our approach improves zero-shot classification results across architectures and datasets, favorably for small objects.
arXiv Detail & Related papers (2023-09-12T20:09:12Z) - CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet [139.56863124214905]
We find that fine-tuning performance of CLIP is substantially underestimated.
Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset.
arXiv Detail & Related papers (2022-12-12T18:59:59Z) - ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation [35.60888272729273]
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme.
While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost.
We propose a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level.
arXiv Detail & Related papers (2022-12-07T12:05:00Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks [85.37552507367175]
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space.
We propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures.
arXiv Detail & Related papers (2022-01-15T01:54:01Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.