ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection
- URL: http://arxiv.org/abs/2505.17692v2
- Date: Tue, 22 Jul 2025 14:34:53 GMT
- Title: ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection
- Authors: Ziteng Yang, Jingzehua Xu, Yanshu Li, Zepeng Li, Yeqiang Wang, Xinghui Li,
- Abstract summary: Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data.<n>Existing CLIP-based methods attempt to activate the model's ZSAD potential via handcrafted or static learnable prompts.<n>ViP$2$-CLIP fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts.
- Score: 2.622385361961154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data. Existing CLIP-based methods attempt to activate the model's ZSAD potential via handcrafted or static learnable prompts. The former incur high engineering costs and limited semantic coverage, whereas the latter apply identical descriptions across diverse anomaly types, thus fail to adapt to complex variations. Furthermore, since CLIP is originally pretrained on large-scale classification tasks, its anomaly segmentation quality is highly sensitive to the exact wording of class names, severely constraining prompting strategies that depend on class labels. To address these challenges, we introduce ViP$^{2}$-CLIP. The key insight of ViP$^{2}$-CLIP is a Visual-Perception Prompting (ViP-Prompt) mechanism, which fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts, eliminating manual templates and class-name priors. This design enables our model to focus on precise abnormal regions, making it particularly valuable when category labels are ambiguous or privacy-constrained. Extensive experiments on 15 industrial and medical benchmarks demonstrate that ViP$^{2}$-CLIP achieves state-of-the-art performance and robust cross-domain generalization.
Related papers
- CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection [6.1568149026052374]
Conditional Prompt Synthesis (CoPS) is a novel framework that synthesizes dynamic prompts conditioned on visual features to enhance ZSAD performance.<n>CoPS surpasses state-of-the-art methods by 2.5% AUROC in both classification and segmentation across 13 industrial and medical datasets.
arXiv Detail & Related papers (2025-08-05T13:47:45Z) - IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain [40.584137588388245]
IQE-CLIP is an innovative framework for anomaly detection tasks in medical domain.<n>We introduce class-based prompting tokens and learnable prompting tokens for better adaptation of CLIP to the medical domain.<n>Our framework achieves state-of-the-art on both zero-shot and few-shot tasks.
arXiv Detail & Related papers (2025-06-12T14:23:06Z) - Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning [58.16354555208417]
Presentation Attack Detection and Face Forgery Detection are designed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes respectively.<n>Separate training of these two models makes them vulnerable to unknown attacks and burdens deployment environments.<n>We present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework (HiPTune) that adaptively explores multiple classification criteria from different semantic spaces.
arXiv Detail & Related papers (2025-05-19T16:35:45Z) - SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting [70.49268117587562]
We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories.<n>During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
arXiv Detail & Related papers (2025-04-24T09:31:08Z) - GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection [13.67800822455087]
A key challenge in ZSAD is learning general prompts stably and utilizing them effectively.<n>We propose GenCLIP, a novel framework that learns and leverages general prompts more effectively.<n>We introduce a dual-branch inference strategy, where a vision-enhanced branch captures fine-grained category-specific features, while a query-only branch prioritizes generalization.
arXiv Detail & Related papers (2025-04-21T07:38:25Z) - GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection [5.530212768657544]
We introduce glocal contrastive learning to improve the complementary learning of global and local prompts.<n>The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets.
arXiv Detail & Related papers (2024-11-09T05:22:13Z) - AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection [14.916862007773341]
This study introduces AdaCLIP for the ZSAD task, leveraging a pre-trained vision-language model (VLM), CLIP.
AdaCLIP incorporates learnable prompts into CLIP and optimize them through training on auxiliary annotated anomaly detection data.
Experiments conducted across 14 real-world anomaly detection datasets from industrial and medical domains indicate that AdaCLIP outperforms other ZSAD methods.
arXiv Detail & Related papers (2024-07-22T16:52:37Z) - CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing [66.6712018832575]
Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains.
We make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features.
arXiv Detail & Related papers (2024-03-21T11:58:50Z) - Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot
Anomaly Localization [63.61093388441298]
Contrastive Language-Image Pre-training models have shown promising performance on zero-shot visual recognition tasks.
In this work, we propose AnoCLIP for zero-shot anomaly localization.
arXiv Detail & Related papers (2023-08-30T10:35:36Z) - Realistic Unsupervised CLIP Fine-tuning with Universal Entropy Optimization [101.08992036691673]
This paper explores a realistic unsupervised fine-tuning scenario, considering the presence of out-of-distribution samples from unknown classes.
In particular, we focus on simultaneously enhancing out-of-distribution detection and the recognition of instances associated with known classes.
We present a simple, efficient, and effective approach called Universal Entropy Optimization (UEO)
arXiv Detail & Related papers (2023-08-24T16:47:17Z) - VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video
Anomaly Detection [58.47940430618352]
We propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD)
VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP.
We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD.
arXiv Detail & Related papers (2023-08-22T14:58:36Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.