AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language
Models
- URL: http://arxiv.org/abs/2308.15366v4
- Date: Thu, 28 Dec 2023 08:22:14 GMT
- Title: AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language
Models
- Authors: Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, Jinqiao
Wang
- Abstract summary: AnomalyGPT is a novel IAD approach based on Large Vision-Language Models (LVLM)
We generate training data by simulating anomalous images and producing corresponding textual descriptions for each image.
AnomalyGPT achieves the state-of-the-art performance with an accuracy of 86.1%, an image-level AUC of 94.1%, and a pixel-level AUC of 95.3% on the MVTec-AD dataset.
- Score: 30.723122000372538
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Vision-Language Models (LVLMs) such as MiniGPT-4 and LLaVA have
demonstrated the capability of understanding images and achieved remarkable
performance in various visual tasks. Despite their strong abilities in
recognizing common objects due to extensive training datasets, they lack
specific domain knowledge and have a weaker understanding of localized details
within objects, which hinders their effectiveness in the Industrial Anomaly
Detection (IAD) task. On the other hand, most existing IAD methods only provide
anomaly scores and necessitate the manual setting of thresholds to distinguish
between normal and abnormal samples, which restricts their practical
implementation. In this paper, we explore the utilization of LVLM to address
the IAD problem and propose AnomalyGPT, a novel IAD approach based on LVLM. We
generate training data by simulating anomalous images and producing
corresponding textual descriptions for each image. We also employ an image
decoder to provide fine-grained semantic and design a prompt learner to
fine-tune the LVLM using prompt embeddings. Our AnomalyGPT eliminates the need
for manual threshold adjustments, thus directly assesses the presence and
locations of anomalies. Additionally, AnomalyGPT supports multi-turn dialogues
and exhibits impressive few-shot in-context learning capabilities. With only
one normal shot, AnomalyGPT achieves the state-of-the-art performance with an
accuracy of 86.1%, an image-level AUC of 94.1%, and a pixel-level AUC of 95.3%
on the MVTec-AD dataset. Code is available at
https://github.com/CASIA-IVA-Lab/AnomalyGPT.
Related papers
- Membership Inference Attacks against Large Vision-Language Models [40.996912464828696]
Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios.
Their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records.
Detecting inappropriately used data in VLLMs remains a critical and unresolved issue.
arXiv Detail & Related papers (2024-11-05T08:35:08Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.992614129625274]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.
It captures high-order correlations of forged images from diverse linguistic feature spaces.
It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z) - VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection [19.79027968793026]
Zero-shot anomaly detection (ZSAD) recognizes and localizes anomalies in previously unseen objects.
Existing ZSAD methods are limited by closed-world settings, struggling to unseen defects with predefined prompts.
We propose a novel framework VMAD (Visual-enhanced MLLM Anomaly Detection) that enhances MLLM with visual-based IAD knowledge and fine-grained perception.
arXiv Detail & Related papers (2024-09-30T09:51:29Z) - MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context.
Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets.
Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection [18.414762007525137]
Large vision-language models (LVLMs) are proficient in deriving visual representations guided by natural language.
Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges.
We present ALFA, a training-free approach designed to address these challenges via a unified model.
arXiv Detail & Related papers (2024-04-15T10:42:22Z) - Few-shot Online Anomaly Detection and Segmentation [29.693357653538474]
This paper focuses on addressing the challenging yet practical few-shot online anomaly detection and segmentation (FOADS) task.
Under the FOADS framework, models are trained on a few-shot normal dataset, followed by inspection and improvement of their capabilities by leveraging unlabeled streaming data containing both normal and abnormal samples simultaneously.
In order to achieve improved performance with limited training samples, we employ multi-scale feature embedding extracted from a CNN pre-trained on ImageNet to obtain a robust representation.
arXiv Detail & Related papers (2024-03-27T02:24:00Z) - Raising the Bar of AI-generated Image Detection with CLIP [50.345365081177555]
The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images.
We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios.
arXiv Detail & Related papers (2023-11-30T21:11:20Z) - AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection [30.679012320439625]
AnomalyCLIP learns object-agnostic text prompts to capture generic normality and abnormality in an image.
It achieves superior zero-shot performance of detecting and segmenting anomalies in datasets of highly diverse class semantics.
arXiv Detail & Related papers (2023-10-29T10:03:49Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.