Fugu-MT 論文翻訳(概要): AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

論文の概要: AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

arxiv url: http://arxiv.org/abs/2604.08983v1
Date: Fri, 10 Apr 2026 05:43:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.704074
Title: AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly
Title（参考訳）: AssemLM:ロボット組立のための空間推論型マルチモーダル大言語モデル
Authors: Zhi Jing, Jinbin Qiao, Ouyang Lu, Jicong Ao, Shuang Qiu, Yu-Gang Jiang, Chenjia Bai,
Abstract要約: 本稿では,ロボット組立に適した空間多モーダル大言語モデルAssemLMを提案する。 AssemLMは、アセンブリマニュアル、ポイントクラウド、テキスト命令を統合して、タスククリティカルな6Dアセンブリのポーズを推論し予測する。本モデルでは, 多様な組立シナリオにまたがって, 6次元モデルにおける最先端性能を実現する。
参考スコア（独自算出の注目度）: 45.963541758601274
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Spatial reasoning is a fundamental capability for embodied intelligence, especially for fine-grained manipulation tasks such as robotic assembly. While recent vision-language models (VLMs) exhibit preliminary spatial awareness, they largely rely on coarse 2D perception and lack the ability to perform accurate reasoning over 3D geometry, which is crucial for precise assembly operations. To address this limitation, we propose AssemLM, a spatial multimodal large language model tailored for robotic assembly. AssemLM integrates assembly manuals, point clouds, and textual instructions to reason about and predict task-critical 6D assembly poses, enabling explicit geometric understanding throughout the assembly process. To effectively bridge raw 3D perception and high-level reasoning, we adopt a specialized point cloud encoder to capture fine-grained geometric and rotational features, which are then integrated into the multimodal language model to support accurate 3D spatial reasoning for assembly tasks. In addition, we construct AssemBench, a large-scale dataset and benchmark for assembly-oriented spatial reasoning, comprising over 900K multimodal samples with precise 6D pose annotations. AssemBench extends spatial reasoning evaluation beyond 2D and grounding tasks into full 3D geometric inference, filling a critical gap in existing embodied AI benchmarks. Extensive experiments demonstrate that AssemLM achieves state-of-the-art performance in 6D pose reasoning across diverse assembly scenarios. Furthermore, real-robot evaluations show that our model can support fine-grained and multi-step assembly execution in real-world settings, demonstrating its potential for robotic assembly applications.
Abstract（参考訳）: 空間推論はインテリジェンスを具現化するための基本的な能力であり、特にロボット組立のようなきめ細かな操作タスクには有効である。近年の視覚言語モデル(VLM)は空間的認識を予備的に示しているが、それらは粗い2次元知覚に依存しており、正確な組み立て操作に不可欠な3次元幾何学よりも正確な推論を行う能力が欠如している。この制限に対処するため,ロボット組立に適した空間多モーダル大言語モデル AssemLM を提案する。 AssemLMは、アセンブリマニュアル、ポイントクラウド、テキスト命令を統合して、タスククリティカルな6Dアセンブリのポーズを推論し、予測し、アセンブリプロセス全体を通して明示的な幾何学的理解を可能にする。生の3次元知覚と高レベル推論を効果的に橋渡しするために,我々は,精密な幾何学的特徴と回転的特徴を捉えるために,特殊な点クラウドエンコーダを採用し,それをマルチモーダル言語モデルに統合して,組立作業の正確な3次元空間的推論を支援する。また,AssemBenchは,高精度な6次元ポーズアノテーションを持つ900万以上のマルチモーダルサンプルからなる,アセンブリ指向空間推論のための大規模データセットとベンチマークである。 AssemBenchは、空間的推論評価を2Dを超えて拡張し、タスクを完全な3D幾何学的推論に拡張し、既存の組込みAIベンチマークにおいて重要なギャップを埋める。大規模な実験により、AssemLMは6Dで最先端のパフォーマンスを達成することが示され、様々な組立シナリオで推論される。さらに、実ロボット評価では、ロボット組立アプリケーションの可能性を示すために、実世界の環境下での細粒度および多段階の組立実行をサポートできることが示されている。

論文の概要: AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

関連論文リスト