Related papers: PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization

PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization

URL: http://arxiv.org/abs/2601.22492v1
Date: Fri, 30 Jan 2026 03:04:06 GMT
Title: PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization
Authors: Duncan McCain, Hossein Kashiani, Fatemeh Afghah,
Abstract summary: We propose PromptMAD, a cross-modal prompting framework for unsupervised visual anomaly detection and localization.<n>Our method enriches visual reconstruction with semantic context, improving the detection of subtle and textural anomalies.<n>Our architecture also includes a supervised segmentor that fuses multi-scale convolutional features with Transformer-based spatial attention.
Score: 9.018570847586878
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual anomaly detection in multi-class settings poses significant challenges due to the diversity of object categories, the scarcity of anomalous examples, and the presence of camouflaged defects. In this paper, we propose PromptMAD, a cross-modal prompting framework for unsupervised visual anomaly detection and localization that integrates semantic guidance through vision-language alignment. By leveraging CLIP-encoded text prompts describing both normal and anomalous class-specific characteristics, our method enriches visual reconstruction with semantic context, improving the detection of subtle and textural anomalies. To further address the challenge of class imbalance at the pixel level, we incorporate Focal loss function, which emphasizes hard-to-detect anomalous regions during training. Our architecture also includes a supervised segmentor that fuses multi-scale convolutional features with Transformer-based spatial attention and diffusion iterative refinement, yielding precise and high-resolution anomaly maps. Extensive experiments on the MVTec-AD dataset demonstrate that our method achieves state-of-the-art pixel-level performance, improving mean AUC to 98.35% and AP to 66.54%, while maintaining efficiency across diverse categories.

Related papers

Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection [65.29550320117526]
We propose a novel framework named FineGrainedAD to improve anomaly localization performance.<n> Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings.
arXiv Detail & Related papers (2025-10-30T13:09:00Z)
Saccadic Vision for Fine-Grained Visual Classification [10.681604440788854]
Fine-grained visual classification (FGVC) requires distinguishing between visually similar categories through subtle, localized features.<n>Existing part-based methods rely on complex localization networks that learn mappings from pixel to sample space.<n>We propose a two-stage process that first extracts peripheral features and generates a sample map.<n>We employ contextualized selective attention to weigh the impact of each fixation patch before fusing peripheral and focus representations.
arXiv Detail & Related papers (2025-09-19T07:03:37Z)
CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection [49.11819337853632]
Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types, and the scarcity of training data.<n>We propose CLIPfusion, a method that leverages both discriminative and generative foundation models.<n>We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection.
arXiv Detail & Related papers (2025-06-13T13:30:15Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection [50.343419243749054]
Anomaly detection is critical in fields such as medical diagnostics and industrial defect detection.<n> CLIP's coarse-grained image-text alignment limits localization and detection performance for fine-grained anomalies.<n>Crane improves the state-of-the-art ZSAD from 2% to 28%, at both image and pixel levels, while remaining competitive in inference speed.
arXiv Detail & Related papers (2025-04-15T10:42:25Z)
Patch-aware Vector Quantized Codebook Learning for Unsupervised Visual Defect Detection [4.081433571732692]
Unsupervised visual defect detection is critical in industrial applications.<n>We propose a novel approach using an enhanced VQ-VAE framework optimized for unsupervised defect detection.
arXiv Detail & Related papers (2025-01-15T22:26:26Z)
Change-Aware Siamese Network for Surface Defects Segmentation under Complex Background [0.6407952035735353]
We propose a change-aware Siamese network that solves the defect segmentation in a change detection framework. A novel multi-class balanced contrastive loss is introduced to guide the Transformer-based encoder. The difference presented by a distance map is then skip-connected to the change-aware decoder to assist in the location of both inter-class and out-of-class pixel-wise defects.
arXiv Detail & Related papers (2024-09-01T02:48:11Z)
DiAD: A Diffusion-based Framework for Multi-class Anomaly Detection [55.48770333927732]
We propose a Difusion-based Anomaly Detection (DiAD) framework for multi-class anomaly detection. It consists of a pixel-space autoencoder, a latent-space Semantic-Guided (SG) network with a connection to the stable diffusion's denoising network, and a feature-space pre-trained feature extractor. Experiments on MVTec-AD and VisA datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-12-11T18:38:28Z)
Multi-Class Anomaly Detection based on Regularized Discriminative Coupled hypersphere-based Feature Adaptation [85.15324009378344]
This paper introduces a new model by including class discriminative properties obtained by a modified Regularized Discriminative Variational Auto-Encoder (RD-VAE) in the feature extraction process. The proposed Regularized Discriminative Coupled-hypersphere-based Feature Adaptation (RD-CFA) forms a solution for multi-class anomaly detection.
arXiv Detail & Related papers (2023-11-24T14:26:07Z)
Improving Vision Anomaly Detection with the Guidance of Language Modality [64.53005837237754]
This paper tackles the challenges for vision modality from a multimodal point of view. We propose Cross-modal Guidance (CMG) to tackle the redundant information issue and sparse space issue. To learn a more compact latent space for the vision anomaly detector, CMLE learns a correlation structure matrix from the language modality.
arXiv Detail & Related papers (2023-10-04T13:44:56Z)
Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection [37.99031842449251]
Video anomaly detection under weak supervision presents significant challenges. We present a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability. Our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy.
arXiv Detail & Related papers (2023-06-26T06:45:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.