Related papers: GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection

GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection

URL: http://arxiv.org/abs/2311.02612v2
Date: Tue, 16 Apr 2024 11:35:37 GMT
Title: GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection
Authors: Jiangning Zhang, Haoyang He, Xuhai Chen, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lei Xie, Yong Liu,
Abstract summary: This paper explores the potential of VQA-oriented GPT-4V in the popular visual Anomaly Detection (AD) task. It is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets.
Score: 51.43589678946244
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding capabilities, making it possible to handle certain tasks through the Visual Question Answering (VQA) paradigm. This paper explores the potential of VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets. Considering that this task requires both image-/pixel-level evaluations, the proposed GPT-4V-AD framework contains three components: \textbf{\textit{1)}} Granular Region Division, \textbf{\textit{2)}} Prompt Designing, \textbf{\textit{3)}} Text2Segmentation for easy quantitative evaluation, and have made some different attempts for comparative analysis. The results show that GPT-4V can achieve certain results in the zero-shot AD task through a VQA paradigm, such as achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTec AD and VisA datasets, respectively. However, its performance still has a certain gap compared to the state-of-the-art zero-shot method, \eg, WinCLIP and CLIP-AD, and further researches are needed. This study provides a baseline reference for the research of VQA-oriented LMM in the zero-shot AD task, and we also post several possible future works. Code is available at \url{https://github.com/zhangzjn/GPT-4V-AD}.

Related papers

Can ChatGPT Perform Image Splicing Detection? A Preliminary Study [0.0]
Multimodal Large Language Models (MLLMs) like GPT-4V are capable of reasoning across text and image modalities.<n>We evaluate GPT-4V using three prompting strategies: Zero-Shot (ZS), Few-Shot (FS), and Chain-of-Thought (CoT)<n>Our results show that GPT-4V achieves competitive detection performance in zero-shot settings (more than 85% accuracy)
arXiv Detail & Related papers (2025-05-22T13:53:53Z)
A Unified Agentic Framework for Evaluating Conditional Image Generation [66.25099219134441]
Conditional image generation has gained significant attention for its ability to personalize content.<n>This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks.
arXiv Detail & Related papers (2025-04-09T17:04:14Z)
Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI [0.6278186810520364]
Most qualitative analysis of and explanation on image data have been conducted by human researchers, without machine-based automation. The recent development of Visual Question Answering (VQA) techniques is accomplishing usable visual language models. This paper aims to introduce VQA for educational studies so that it provides a milestone for educational research methodology.
arXiv Detail & Related papers (2024-05-12T05:05:31Z)
An Evaluation of GPT-4V and Gemini in Online VQA [31.77015255871848]
We evaluate two state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering dataset. We conduct fine-grained analysis by generating seven types of metadata for nearly 2,000 visual questions. Our zero-shot performance analysis highlights the types of questions that are most challenging for both models.
arXiv Detail & Related papers (2023-12-17T07:38:43Z)
Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection [128.40330044868293]
Vision Transformer (ViT) showcasing a more straightforward architecture has proven effective in multiple domains. ViTAD achieves state-of-the-art results and efficiency on MVTec AD, VisA, and Uni-Medical datasets.
arXiv Detail & Related papers (2023-12-12T18:28:59Z)
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? [82.40761196684524]
This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds. Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
arXiv Detail & Related papers (2023-11-27T11:29:10Z)
Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study [26.17177931611486]
We present a preliminary case study investigating the recommendation capabilities of GPT-4V(ison), a recently released LMM by OpenAI. We employ a series of qualitative test samples spanning multiple domains to assess the quality of GPT-4V's responses within recommendation scenarios. We have also identified some limitations in using GPT-4V for recommendations, including a tendency to provide similar responses when given similar inputs.
arXiv Detail & Related papers (2023-11-07T18:39:10Z)
GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment. Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z)
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [103.68138147783614]
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models. We employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions, and overlay these regions with a set of marks. Using the marked image as input, GPT-4V can answer the questions that require visual grounding.
arXiv Detail & Related papers (2023-10-17T17:51:31Z)
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs. GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system. GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.