Fugu-MT 論文翻訳(概要): IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection

論文の概要: IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection

arxiv url: http://arxiv.org/abs/2510.16036v1
Date: Thu, 16 Oct 2025 02:48:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:38.79308
Title: IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection
Title（参考訳）: IAD-GPT:産業異常検出のための多モーダル大言語モデルにおける視覚的知識の向上
Authors: Zewen Li, Zitong Yu, Qilang Ye, Weicheng Xie, Wei Zhuo, Linlin Shen,
Abstract要約: 本稿では,リッチテキストセマンティクスと画像レベルの情報と画素レベルの情報の組み合わせについて検討する。産業異常検出のためのMLLMに基づく新しいパラダイムであるIAD-GPTを提案する。 MVTec-ADとVisAデータセットの実験は、私たちの最先端のパフォーマンスを示しています。
参考スコア（独自算出の注目度）: 70.02774285130238
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The robust causal capability of Multimodal Large Language Models (MLLMs) hold the potential of detecting defective objects in Industrial Anomaly Detection (IAD). However, most traditional IAD methods lack the ability to provide multi-turn human-machine dialogues and detailed descriptions, such as the color of objects, the shape of an anomaly, or specific types of anomalies. At the same time, methods based on large pre-trained models have not fully stimulated the ability of large models in anomaly detection tasks. In this paper, we explore the combination of rich text semantics with both image-level and pixel-level information from images and propose IAD-GPT, a novel paradigm based on MLLMs for IAD. We employ Abnormal Prompt Generator (APG) to generate detailed anomaly prompts for specific objects. These specific prompts from the large language model (LLM) are used to activate the detection and segmentation functions of the pre-trained visual-language model (i.e., CLIP). To enhance the visual grounding ability of MLLMs, we propose Text-Guided Enhancer, wherein image features interact with normal and abnormal text prompts to dynamically select enhancement pathways, which enables language models to focus on specific aspects of visual data, enhancing their ability to accurately interpret and respond to anomalies within images. Moreover, we design a Multi-Mask Fusion module to incorporate mask as expert knowledge, which enhances the LLM's perception of pixel-level anomalies. Extensive experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance on self-supervised and few-shot anomaly detection and segmentation tasks, such as MVTec-AD and VisA datasets. The codes are available at \href{https://github.com/LiZeWen1225/IAD-GPT}{https://github.com/LiZeWen1225/IAD-GPT}.
Abstract（参考訳）: 多モード大言語モデル(MLLM)の堅牢な因果能力は、産業異常検出(IAD)における欠陥物体を検出する可能性を秘めている。しかし、従来のIAD手法では、オブジェクトの色、異常の形状、特定の種類の異常など、マルチターンの人間と機械の対話や詳細な記述を提供する能力が欠如している。同時に、大規模事前学習モデルに基づく手法は、異常検出タスクにおける大規模モデルの能力を十分に刺激していない。本稿では,リッチテキストセマンティクスと画像レベルの情報と画素レベルの情報の組み合わせについて検討し,IADのためのMLLMに基づく新しいパラダイムであるIAD-GPTを提案する。我々は、特定のオブジェクトに対して詳細な異常プロンプトを生成するために、異常プロンプトジェネレータ(APG)を用いる。大型言語モデル(LLM)からのこれらの特定のプロンプトは、事前訓練された視覚言語モデル(CLIP)の検出とセグメンテーション機能を活性化するために使用される。 MLLMの視覚的基盤化能力を高めるために,画像特徴と正常および異常なテキストプロンプトを相互作用させて動的に強調経路を選択できるテキストガイドエンハンサーを提案する。さらに,マスクを専門知識として組み込むマルチマスク融合モジュールを設計し,画素レベルの異常に対するLLMの認識を高める。 MVTec-ADデータセットとVisAデータセットに関する大規模な実験は、MVTec-ADデータセットやVisAデータセットのような、自己教師付きおよび少数ショットの異常検出およびセグメンテーションタスクにおける最先端のパフォーマンスを実証する。コードは \href{https://github.com/LiZeWen1225/IAD-GPT}{https://github.com/LiZeWen1225/IAD-GPT} で公開されている。

論文の概要: IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection

関連論文リスト