Fugu-MT 論文翻訳(概要): Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline

論文の概要: Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline

arxiv url: http://arxiv.org/abs/2606.07965v1
Date: Sat, 06 Jun 2026 03:48:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.588464
Title: Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline
Title（参考訳）: 産業シナリオにおけるゼロショット学習 - 新しい大規模ベンチマーク、課題、ベースライン
Authors: Zekai Zhang, Qinghui Chen, Maomao Xiong, Shijiao Ding, Zhanzhi Su, Xinjie Yao, Yiming Sun, Cong Bai, Jinglin Zhang,
Abstract要約: 本稿では,ゼロショット産業欠陥検出のためのオープン産業データセットとRTVP(Refined Text-Visual Prompt)を提案する。 MMIOは、産業用ゼロショット学習のための、最初の大規模マルチシーン事前学習データセットである。 RTVPは画像から直接視覚的プロンプトを自動生成し、テキストと視覚的プロンプトの相互作用を考慮する。
参考スコア（独自算出の注目度）: 28.249460268707978
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Visual Language Models (LVLMs) have achieved remarkable success in vision tasks. However, the significant differences between industrial and natural scenes make applying LVLMs challenging. Existing LVLMs rely on user-provided prompts to segment objects. This often leads to suboptimal performance due to the inclusion of irrelevant pixels. In addition, the scarcity of data also makes the application of LVLMs in industrial scenarios remain unexplored. To fill this gap, this paper proposes an open industrial dataset and a Refined Text-Visual Prompt (RTVP) for zero-shot industrial defect detection. First, this paper constructs the Multi-Modal Industrial Open Dataset (MMIO) containing 80K+ samples. MMIO contains diverse industrial categories, including 6 super categories and 18 subcategories. MMIO is the first large-scale multi-scenes pre-training dataset for industrial zero-shot learning, and provides valuable training data for open models in future industrial scenarios. Based on MMIO, this paper provides a RTVP specifically for industrial zero-shot tasks. RTVP has two significant advantages: First, this paper designs an expert-guided large model domain adaptation mechanism and designs an industrial zero-shot method based on Mobile-SAM, which enhances the generalization ability of large models in industrial scenarios. Second, RTVP automatically generates visual prompts directly from images and considers text-visual prompt interactions ignored by previous LVLM, improving visual and textual content understanding. RTVP achieves SOTA with 42.2% and 24.7% AP in zero-shot and closed scenes of MMIO.
Abstract（参考訳）: 大規模視覚言語モデル (LVLM) は視覚タスクにおいて顕著な成功を収めた。しかし、産業シーンと自然シーンの顕著な違いは、LVLMの適用を困難にしている。既存のLVLMは、オブジェクトをセグメント化するユーザーが提供するプロンプトに依存している。これはしばしば、無関係なピクセルを含むため、最適以下のパフォーマンスをもたらす。加えて、データの不足により、産業シナリオにおけるLVLMの応用も未解明のままである。このギャップを埋めるために、ゼロショット産業欠陥検出のためのオープン産業データセットとRefined Text-Visual Prompt(RTVP)を提案する。まず,80K以上のサンプルを含むMMIO(Multi-Modal Industrial Open Dataset)を構築する。 MMIOには6つのスーパーカテゴリと18のサブカテゴリを含む様々な産業カテゴリがある。 MMIOは、産業用ゼロショット学習のための最初の大規模マルチシーン事前トレーニングデータセットであり、将来の産業シナリオにおけるオープンモデルのための貴重なトレーニングデータを提供する。 MMIOに基づいて,産業用ゼロショットタスクに特化してRTVPを提供する。 RTVPには2つの大きな利点がある: まず、専門家が指導する大規模モデルドメイン適応機構を設計し、産業シナリオにおける大規模モデルの一般化能力を向上するMobile-SAMに基づく産業ゼロショット法を設計する。第2に、RTVPは画像から直接視覚的プロンプトを自動生成し、従来のLVLMで無視されたテキスト-視覚的プロンプトを考慮し、視覚的およびテキスト的コンテンツ理解を改善する。 RTVPは、MMIOのゼロショットおよびクローズドシーンにおいて42.2%、24.7%のAPでSOTAを達成する。

論文の概要: Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline

関連論文リスト