Myriad: Large Multimodal Model by Applying Vision Experts for Industrial
Anomaly Detection
- URL: http://arxiv.org/abs/2310.19070v2
- Date: Wed, 1 Nov 2023 03:50:52 GMT
- Title: Myriad: Large Multimodal Model by Applying Vision Experts for Industrial
Anomaly Detection
- Authors: Yuanze Li, Haolin Wang, Shihao Yuan, Ming Liu, Debin Zhao, Yiwen Guo,
Chen Xu, Guangming Shi, Wangmeng Zuo
- Abstract summary: We propose a novel large multi-modal model by applying vision experts for industrial anomaly detection (dubbed Myriad)
Specifically, we adopt MiniGPT-4 as the base LMM and design an Expert Perception module to embed the prior knowledge from vision experts as tokens which are intelligible to Large Language Models (LLMs)
To compensate for the errors and confusions of vision experts, we introduce a domain adapter to bridge the visual representation gaps between generic and industrial images.
- Score: 89.49244928440221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing industrial anomaly detection (IAD) methods predict anomaly scores
for both anomaly detection and localization. However, they struggle to perform
a multi-turn dialog and detailed descriptions for anomaly regions, e.g., color,
shape, and categories of industrial anomalies. Recently, large multimodal
(i.e., vision and language) models (LMMs) have shown eminent perception
abilities on multiple vision tasks such as image captioning, visual
understanding, visual reasoning, etc., making it a competitive potential choice
for more comprehensible anomaly detection. However, the knowledge about anomaly
detection is absent in existing general LMMs, while training a specific LMM for
anomaly detection requires a tremendous amount of annotated data and massive
computation resources. In this paper, we propose a novel large multi-modal
model by applying vision experts for industrial anomaly detection (dubbed
Myriad), which leads to definite anomaly detection and high-quality anomaly
description. Specifically, we adopt MiniGPT-4 as the base LMM and design an
Expert Perception module to embed the prior knowledge from vision experts as
tokens which are intelligible to Large Language Models (LLMs). To compensate
for the errors and confusions of vision experts, we introduce a domain adapter
to bridge the visual representation gaps between generic and industrial images.
Furthermore, we propose a Vision Expert Instructor, which enables the Q-Former
to generate IAD domain vision-language tokens according to vision expert prior.
Extensive experiments on MVTec-AD and VisA benchmarks demonstrate that our
proposed method not only performs favorably against state-of-the-art methods
under the 1-class and few-shot settings, but also provide definite anomaly
prediction along with detailed descriptions in IAD domain.
Related papers
- Can Multimodal Large Language Models be Guided to Improve Industrial Anomaly Detection? [5.979778557940213]
Traditional industrial anomaly detection models often struggle with flexibility and adaptability.
Recent advancements in Multimodal Large Language Models (MLLMs) hold promise for overcoming these limitations.
We propose Echo, a novel multi-expert framework designed to enhance MLLM performance for IAD.
arXiv Detail & Related papers (2025-01-27T05:41:10Z) - Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.
EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.
Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z) - Chimera: Improving Generalist Model with Domain-Specific Experts [35.706585190958634]
We introduce a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts.
Specifically, we design a progressive training strategy to integrate features from expert models into the input of a generalist LMM.
This results in a versatile model that excels across the chart, table, math, and document domains.
arXiv Detail & Related papers (2024-12-08T16:10:42Z) - VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection [19.79027968793026]
Zero-shot anomaly detection (ZSAD) recognizes and localizes anomalies in previously unseen objects.
Existing ZSAD methods are limited by closed-world settings, struggling to unseen defects with predefined prompts.
We propose a novel framework VMAD (Visual-enhanced MLLM Anomaly Detection) that enhances MLLM with visual-based IAD knowledge and fine-grained perception.
arXiv Detail & Related papers (2024-09-30T09:51:29Z) - VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents [50.12414817737912]
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents.
Existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments.
VisualAgentBench (VAB) is a pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents.
arXiv Detail & Related papers (2024-08-12T17:44:17Z) - Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks.
Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs.
We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z) - Large Language Models can Deliver Accurate and Interpretable Time Series Anomaly Detection [34.40206965758026]
Time series anomaly detection (TSAD) plays a crucial role in various industries by identifying atypical patterns that deviate from standard trends.
Traditional TSAD models, which often rely on deep learning, require extensive training data and operate as black boxes.
We propose LLMAD, a novel TSAD method that employs Large Language Models (LLMs) to deliver accurate and interpretable TSAD results.
arXiv Detail & Related papers (2024-05-24T09:07:02Z) - Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement.
Lumen first promotes fine-grained vision-language concept alignment.
Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z) - MinT: Boosting Generalization in Mathematical Reasoning via Multi-View
Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs)
We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles.
Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.