Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot
Anomaly Localization
- URL: http://arxiv.org/abs/2308.15939v2
- Date: Tue, 27 Feb 2024 00:07:47 GMT
- Title: Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot
Anomaly Localization
- Authors: Hanqiu Deng, Zhaoxiang Zhang, Jinan Bao, Xingyu Li
- Abstract summary: Contrastive Language-Image Pre-training models have shown promising performance on zero-shot visual recognition tasks.
In this work, we propose AnoCLIP for zero-shot anomaly localization.
- Score: 63.61093388441298
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) models have shown promising
performance on zero-shot visual recognition tasks by learning visual
representations under natural language supervision. Recent studies attempt the
use of CLIP to tackle zero-shot anomaly detection by matching images with
normal and abnormal state prompts. However, since CLIP focuses on building
correspondence between paired text prompts and global image-level
representations, the lack of fine-grained patch-level vision to text alignment
limits its capability on precise visual anomaly localization. In this work, we
propose AnoCLIP for zero-shot anomaly localization. In the visual encoder, we
introduce a training-free value-wise attention mechanism to extract intrinsic
local tokens of CLIP for patch-level local description. From the perspective of
text supervision, we particularly design a unified domain-aware contrastive
state prompting template for fine-grained vision-language matching. On top of
the proposed AnoCLIP, we further introduce a test-time adaptation (TTA)
mechanism to refine visual anomaly localization results, where we optimize a
lightweight adapter in the visual encoder using AnoCLIP's pseudo-labels and
noise-corrupted tokens. With both AnoCLIP and TTA, we significantly exploit the
potential of CLIP for zero-shot anomaly localization and demonstrate the
effectiveness of AnoCLIP on various datasets.
Related papers
- Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection [58.228940066769596]
We introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system.
Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context.
Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
arXiv Detail & Related papers (2024-05-08T03:13:20Z) - Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection [18.414762007525137]
Large vision-language models (LVLMs) are proficient in deriving visual representations guided by natural language.
Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges.
We present ALFA, a training-free approach designed to address these challenges via a unified model.
arXiv Detail & Related papers (2024-04-15T10:42:22Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Symmetrical Linguistic Feature Distillation with CLIP for Scene Text
Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR)
By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow.
Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z) - CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
Prediction [67.43527289422978]
We propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs.
We achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.
arXiv Detail & Related papers (2023-10-02T17:58:52Z) - ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free
Domain Adaptation [20.57370550156505]
ReCLIP is a source-free domain adaptation method for vision-language models.
We demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.
arXiv Detail & Related papers (2023-08-04T18:11:40Z) - A Closer Look at the Explainability of Contrastive Language-Image Pre-training [16.10032166963232]
Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks.
We have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks.
We propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features.
arXiv Detail & Related papers (2023-04-12T07:16:55Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.