Fugu-MT 論文翻訳(概要): AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

論文の概要: AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

arxiv url: http://arxiv.org/abs/2603.13779v1
Date: Sat, 14 Mar 2026 06:14:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.401498
Title: AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison
Title（参考訳）: AD-Copilot:ビジュアルインコンテキスト比較による産業異常検出のための視覚言語アシスタント
Authors: Xi Jiang, Yue Guo, Jian Li, Yong Liu, Bin-Bin Gao, Hanqiu Deng, Jun Liu, Heng Zhao, Chengjie Wang, Feng Zheng,
Abstract要約: 産業異常検出(IAD)に特化した対話型MLLMAD-Copilotを提案する。我々はまず,少ないラベル付き産業画像から検査知識を抽出するために,新しいデータパイプラインを設計する。次に、キャプション、VQA、欠陥局所化の正確なサンプルを生成し、IADのセマンティック信号に富んだ大規模マルチモーダル比較-ADを生成する。実験の結果、AD-CopilotはMMADベンチマークで82.3%の精度を達成し、データ漏洩のない他のモデルよりも優れていることが示された。
参考スコア（独自算出の注目度）: 89.0720931534819
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive success in natural visual understanding, yet they consistently underperform in industrial anomaly detection (IAD). This is because MLLMs trained mostly on general web data differ significantly from industrial images. Moreover, they encode each image independently and can only compare images in the language space, making them insensitive to subtle visual differences that are key to IAD. To tackle these issues, we present AD-Copilot, an interactive MLLM specialized for IAD via visual in-context comparison. We first design a novel data curation pipeline to mine inspection knowledge from sparsely labeled industrial images and generate precise samples for captioning, VQA, and defect localization, yielding a large-scale multimodal dataset Chat-AD rich in semantic signals for IAD. On this foundation, AD-Copilot incorporates a novel Comparison Encoder that employs cross-attention between paired image features to enhance multi-image fine-grained perception, and is trained with a multi-stage strategy that incorporates domain knowledge and gradually enhances IAD skills. In addition, we introduce MMAD-BBox, an extended benchmark for anomaly localization with bounding-box-based evaluation. The experiments show that AD-Copilot achieves 82.3% accuracy on the MMAD benchmark, outperforming all other models without any data leakage. In the MMAD-BBox test, it achieves a maximum improvement of $3.35\times$ over the baseline. AD-Copilot also exhibits excellent generalization of its performance gains across other specialized and general-purpose benchmarks. Remarkably, AD-Copilot surpasses human expert-level performance on several IAD tasks, demonstrating its potential as a reliable assistant for real-world industrial inspection. All datasets and models will be released for the broader benefit of the community.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、自然な視覚的理解において顕著な成功を収めているが、産業的異常検出(IAD)では一貫して性能が劣っている。これは、MLLMが一般的なWebデータに基づいて訓練されているためであり、産業画像とは大きく異なる。さらに、各画像を独立してエンコードし、言語空間内の画像のみを比較することができるため、IADの鍵となる微妙な視覚的差異に敏感である。これらの課題に対処するために、視覚的インコンテキスト比較によりIDAに特化した対話型MLLMであるAD-Copilotを提案する。我々はまず,細かなラベル付き産業画像から検査知識を抽出し,キャプション,VQA,欠陥ローカライゼーションの正確なサンプルを生成する新しいデータキュレーションパイプラインを設計し,IADのセマンティック信号に富んだ大規模マルチモーダルデータセットChat-ADを生成する。この基盤の上に、AD-Copilotは、ペア画像特徴間のクロスアテンションを利用して、マルチイメージのきめ細かい認識を高め、ドメイン知識を取り入れ、徐々にIADスキルを強化するマルチステージ戦略で訓練される、新しい比較エンコーダを組み込んでいる。さらに,バウンディングボックスに基づく評価による異常なローカライゼーションのための拡張ベンチマークMMAD-BBoxを導入する。実験の結果、AD-CopilotはMMADベンチマークで82.3%の精度を達成し、データ漏洩のない他のモデルよりも優れていることがわかった。 MMAD-BBoxテストでは、ベースライン上で最大3.35\times$を達成している。 AD-Copilotは、他の専門的および汎用的なベンチマークよりも優れたパフォーマンス向上の一般化を示す。注目すべきは、AD-Copilotは、複数のIADタスクにおける人間の専門家レベルのパフォーマンスを上回り、実世界の産業検査の信頼性の高いアシスタントとしての可能性を示していることだ。すべてのデータセットとモデルは、コミュニティの幅広い利益のためにリリースされます。

論文の概要: AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

関連論文リスト