Fugu-MT 論文翻訳(概要): FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

論文の概要: FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

arxiv url: http://arxiv.org/abs/2604.07413v1
Date: Wed, 08 Apr 2026 12:23:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.459727
Title: FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios
Title（参考訳）: FORGE:製造シナリオのマルチモーダル評価
Authors: Xiangru Jian, Hao Xu, Wei Pang, Xinjian Zhao, Chengyu Tao, Qixin Zhang, Xikun Zhang, Chao Zhang, Guanzhi Deng, Alex Xue, Juan Du, Tianshu Yu, Garth Tarr, Linqi Song, Qiuzhuang Sun, Dacheng Tao,
Abstract要約: 製造業セクターは、単純な認識から自律的な実行に移行するために、MLLM(Multimodal Large Language Models)をますます採用している。進捗は、データの不足と、既存のデータセットにおけるきめ細かいドメインセマンティクスの欠如によって妨げられている。まず、実世界の2D画像と3Dポイントクラウドを組み合わせて、微粒なドメインセマンティクスを付加した高品質なデータセットを構築します。次に, 3 つの製造課題,すなわち, 構造面検査, 組立検査, 組立検証の18の最先端MLLMを評価し, 大幅な性能差を明らかにした。
参考スコア（独自算出の注目度）: 58.34124792457706
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.
Abstract（参考訳）: 製造業セクターは、単純な認識から自律的な実行への移行に向けて、MLLM(Multimodal Large Language Models)の採用が増えているが、現在の評価は、現実の製造業環境の厳密な要求を反映していない。進捗は、データの不足と、既存のデータセットにおけるきめ細かいドメインセマンティクスの欠如によって妨げられている。このギャップを埋めるために、ForGEを紹介します。まず、実世界の2D画像と3Dポイントクラウドを組み合わせた高品質なマルチモーダルデータセットを構築します。次に,3つの製造課題,すなわちワークピース検証,構造表面検査,組立検証の18の最先端MLLMを評価し,大幅な性能差を明らかにした。従来の理解とは対照的に、ボトルネック分析は視覚的接地が主要な制限要因ではないことを示している。代わりに、ドメイン固有の知識が不足することが重要なボトルネックであり、将来の研究の明確な方向性を定めている。我々のデータに対するコンパクトな3Bパラメータモデルの微調整は、保持された製造シナリオの精度を最大90.8%向上させ、ドメイン適応型製造MLLMへの実践的経路の予備的証拠を提供する。コードとデータセットはhttps://ai4manufacturing.github.io/forge-webで公開されている。

論文の概要: FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

関連論文リスト