Myriad: Large Multimodal Model by Applying Vision Experts for Industrial
Anomaly Detection
- URL: http://arxiv.org/abs/2310.19070v2
- Date: Wed, 1 Nov 2023 03:50:52 GMT
- Title: Myriad: Large Multimodal Model by Applying Vision Experts for Industrial
Anomaly Detection
- Authors: Yuanze Li, Haolin Wang, Shihao Yuan, Ming Liu, Debin Zhao, Yiwen Guo,
Chen Xu, Guangming Shi, Wangmeng Zuo
- Abstract summary: We propose a novel large multi-modal model by applying vision experts for industrial anomaly detection (dubbed Myriad)
Specifically, we adopt MiniGPT-4 as the base LMM and design an Expert Perception module to embed the prior knowledge from vision experts as tokens which are intelligible to Large Language Models (LLMs)
To compensate for the errors and confusions of vision experts, we introduce a domain adapter to bridge the visual representation gaps between generic and industrial images.
- Score: 89.49244928440221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing industrial anomaly detection (IAD) methods predict anomaly scores
for both anomaly detection and localization. However, they struggle to perform
a multi-turn dialog and detailed descriptions for anomaly regions, e.g., color,
shape, and categories of industrial anomalies. Recently, large multimodal
(i.e., vision and language) models (LMMs) have shown eminent perception
abilities on multiple vision tasks such as image captioning, visual
understanding, visual reasoning, etc., making it a competitive potential choice
for more comprehensible anomaly detection. However, the knowledge about anomaly
detection is absent in existing general LMMs, while training a specific LMM for
anomaly detection requires a tremendous amount of annotated data and massive
computation resources. In this paper, we propose a novel large multi-modal
model by applying vision experts for industrial anomaly detection (dubbed
Myriad), which leads to definite anomaly detection and high-quality anomaly
description. Specifically, we adopt MiniGPT-4 as the base LMM and design an
Expert Perception module to embed the prior knowledge from vision experts as
tokens which are intelligible to Large Language Models (LLMs). To compensate
for the errors and confusions of vision experts, we introduce a domain adapter
to bridge the visual representation gaps between generic and industrial images.
Furthermore, we propose a Vision Expert Instructor, which enables the Q-Former
to generate IAD domain vision-language tokens according to vision expert prior.
Extensive experiments on MVTec-AD and VisA benchmarks demonstrate that our
proposed method not only performs favorably against state-of-the-art methods
under the 1-class and few-shot settings, but also provide definite anomaly
prediction along with detailed descriptions in IAD domain.
Related papers
- VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection [19.79027968793026]
Zero-shot anomaly detection (ZSAD) recognizes and localizes anomalies in previously unseen objects.
Existing ZSAD methods are limited by closed-world settings, struggling to unseen defects with predefined prompts.
We propose a novel framework VMAD (Visual-enhanced MLLM Anomaly Detection) that enhances MLLM with visual-based IAD knowledge and fine-grained perception.
arXiv Detail & Related papers (2024-09-30T09:51:29Z) - Vision-Language Models Assisted Unsupervised Video Anomaly Detection [3.1095294567873606]
Anomaly samples present significant challenges for unsupervised learning methods.
Our method employs a cross-modal pre-trained model that leverages the inferential capabilities of large language models.
By mapping high-dimensional visual features to low-dimensional semantic ones, our method significantly enhance the interpretability of unsupervised anomaly detection.
arXiv Detail & Related papers (2024-09-21T11:48:54Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.
Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.
We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods.
The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics.
We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z) - Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning [3.2331030725755645]
We develop a generic anomaly detection model applicable across multiple scenarios.
Our approach considers multi-modal prompt types, including task descriptions, class context, normality rules, and reference images.
Our preliminary studies demonstrate that combining visual and language prompts as conditions for customizing the models enhances anomaly detection performance.
arXiv Detail & Related papers (2024-03-17T04:30:57Z) - Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection [59.41026558455904]
We focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets.
We propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection.
arXiv Detail & Related papers (2024-01-06T07:30:41Z) - Open-Vocabulary Video Anomaly Detection [57.552523669351636]
Video anomaly detection (VAD) with weak supervision has achieved remarkable performance in utilizing video-level labels to discriminate whether a video frame is normal or abnormal.
Recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos.
This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies.
arXiv Detail & Related papers (2023-11-13T02:54:17Z) - Prototypical Residual Networks for Anomaly Detection and Localization [80.5730594002466]
We propose a framework called Prototypical Residual Network (PRN)
PRN learns feature residuals of varying scales and sizes between anomalous and normal patterns to accurately reconstruct the segmentation maps of anomalous regions.
We present a variety of anomaly generation strategies that consider both seen and unseen appearance variance to enlarge and diversify anomalies.
arXiv Detail & Related papers (2022-12-05T05:03:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.