Related papers: Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection

Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection

URL: http://arxiv.org/abs/2404.09654v2
Date: Tue, 10 Sep 2024 11:58:23 GMT
Title: Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection
Authors: Jiaqi Zhu, Shaofeng Cai, Fang Deng, Beng Chin Ooi, Junran Wu,
Abstract summary: Large vision-language models (LVLMs) are proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges. We present ALFA, a training-free approach designed to address these challenges via a unified model.
Score: 18.414762007525137
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomaly prompts that are prone to cross-semantic ambiguity, and prioritize global image-level representations over crucial local pixel-level image-to-text alignment that is necessary for accurate anomaly localization. In this paper, we present ALFA, a training-free approach designed to address these challenges via a unified model. We propose a run-time prompt adaptation strategy, which first generates informative anomaly prompts to leverage the capabilities of a large language model (LLM). This strategy is enhanced by a contextual scoring mechanism for per-image anomaly prompt adaptation and cross-semantic ambiguity mitigation. We further introduce a novel fine-grained aligner to fuse local pixel-level semantics for precise anomaly localization, by projecting the image-text alignment from global to local semantic spaces. Extensive evaluations on MVTec and VisA datasets confirm ALFA's effectiveness in harnessing the language potential for zero-shot VAD, achieving significant PRO improvements of 12.1% on MVTec and 8.9% on VisA compared to state-of-the-art approaches.

Related papers

Weakly-Supervised Image Forgery Localization via Vision-Language Collaborative Reasoning Framework [16.961220047066792]
ViLaCo is a vision-language collaborative reasoning framework that introduces auxiliary semantic supervision distilled from pre-trained vision-language models.<n>ViLaCo substantially outperforms existing WSIFL methods, achieving state-of-the-art performance in both detection and localization accuracy.
arXiv Detail & Related papers (2025-08-02T12:14:29Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis [46.502962768034166]
Zero-shot Referring Image identifies the instance mask that best aligns with a referring expression without training and fine-tuning. Previous CLIP-based models exhibit a notable reduction in their capacity to discern relative spatial relationships of objects. IteRPrimE outperforms previous state-of-the-art zero-shot methods, particularly excelling in out-of-domain scenarios.
arXiv Detail & Related papers (2025-03-02T15:19:37Z)
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models [58.936893810674896]
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. We introduce a multimodal large language model framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS) We propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images.
arXiv Detail & Related papers (2025-01-03T09:25:04Z)
GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection [5.530212768657544]
We introduce glocal contrastive learning to improve the learning of global and local prompts, effectively detecting abnormal patterns across various domains. The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets from both the industrial and medical domains.
arXiv Detail & Related papers (2024-11-09T05:22:13Z)
VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection [19.79027968793026]
Zero-shot anomaly detection (ZSAD) recognizes and localizes anomalies in previously unseen objects. Existing ZSAD methods are limited by closed-world settings, struggling to unseen defects with predefined prompts. We propose a novel framework VMAD (Visual-enhanced MLLM Anomaly Detection) that enhances MLLM with visual-based IAD knowledge and fine-grained perception.
arXiv Detail & Related papers (2024-09-30T09:51:29Z)
Human-Free Automated Prompting for Vision-Language Anomaly Detection: Prompt Optimization with Meta-guiding Prompt Scheme [19.732769780675977]
Pre-trained vision-language models (VLMs) are highly adaptable to various downstream tasks through few-shot learning. Traditional methods depend on human-crafted prompts that require prior knowledge of specific anomaly types. Our goal is to develop a human-free prompt-based anomaly detection framework that optimally learns prompts through data-driven methods.
arXiv Detail & Related papers (2024-06-26T09:29:05Z)
Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection [58.228940066769596]
We introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system. Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context. Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
arXiv Detail & Related papers (2024-05-08T03:13:20Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent. Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments. We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z)
Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization [63.61093388441298]
Contrastive Language-Image Pre-training models have shown promising performance on zero-shot visual recognition tasks. In this work, we propose AnoCLIP for zero-shot anomaly localization.
arXiv Detail & Related papers (2023-08-30T10:35:36Z)
AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models [30.723122000372538]
AnomalyGPT is a novel IAD approach based on Large Vision-Language Models (LVLM) We generate training data by simulating anomalous images and producing corresponding textual descriptions for each image. AnomalyGPT achieves the state-of-the-art performance with an accuracy of 86.1%, an image-level AUC of 94.1%, and a pixel-level AUC of 95.3% on the MVTec-AD dataset.
arXiv Detail & Related papers (2023-08-29T15:02:53Z)
Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition [92.6211155264297]
Vision models have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task. Recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper. We propose a $textbfL$inguistic $textbfP$erception $textbfV$ision model (LPV) which explores the linguistic capability of vision model for accurate text recognition.
arXiv Detail & Related papers (2023-05-09T02:52:47Z)
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment [52.489874804051304]
VoLTA is a new vision-language pre-training paradigm that only utilizes image-caption data but fine-grained region-level image understanding. VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training. Experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA.
arXiv Detail & Related papers (2022-10-09T01:49:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.