Fugu-MT 論文翻訳(概要): IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

論文の概要: IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

arxiv url: http://arxiv.org/abs/2606.14383v2
Date: Tue, 16 Jun 2026 03:59:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 15:01:46.638771
Title: IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products
Title（参考訳）: IndustryBench-MIPU: 産業製品におけるマルチイメージ属性値抽出のベンチマーク
Authors: Haonan Qi, Jin Cao, Yongqi Zhang, Xintong Wang, Weidong Tang, Bin Chen, Chengfu Huo, Haojun Pan, Hengyu You, Jing Li, Yingde Wang, Liang Ding,
Abstract要約: IndustryBench-MIPUは、マルチイメージ産業製品理解のための最初の大規模ベンチマークである。仕様表とネームプレートのテキスト認識、技術図面に対する視覚的推論、ドメイン知識、そして散在する仕様を組み立てるためのクロスイメージエビデンスの統合を探索する。ベンチマークは、27,652枚にわたる4,559個の製品と、18の産業カテゴリにまたがる103,703個のアノテーションで構成されている。
参考スコア（独自算出の注目度）: 24.36543103640838
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction -- recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86--94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15--34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.
Abstract（参考訳）: バルブやサーキットブレーカーなどの工業製品は、サプライチェーン間の調達、互換性、安全性を管理する密集した技術仕様によって定義される。これらの仕様は、仕様表、ネームプレート、技術図面を含む多種多様な製品画像に分散しているが、Multimodal Large Language Models (MLLM) が確実に回復できるかどうかはまだ未定である。このギャップを埋めるために、IndustrialBench-MIPUを紹介します。これは、構造化属性抽出を中心に構築された、マルチイメージ産業製品理解のための最初の大規模ベンチマークです。このタスクは、仕様表やネームプレート上のテキスト認識、技術図面に対する視覚的推論、産業用語をデコードするためのドメイン知識、散在する仕様を組み立てるためのクロスイメージエビデンスの統合を共同で調査する。具体的には、このベンチマークは、27,652枚にわたる4,559個の製品と、18の産業カテゴリにまたがる103,703個のアノテーションで構成され、マルチモデルコンセンサスと3層品質保証によって構築されている。モデルの精度は86-94%だが、最高のリカバリは製品レベルの属性の49.9%に過ぎず、シングルイメージからマルチイメージの抽出コストは15-34ポイントである。マルチイメージの完全性は、シングルイメージの精度ではなく、コアボトルネックである。データセットとコードは公開されています。

論文の概要: IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

関連論文リスト