Fugu-MT 論文翻訳(概要): Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

論文の概要: Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

arxiv url: http://arxiv.org/abs/2606.05445v1
Date: Wed, 03 Jun 2026 21:08:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.411965
Title: Brick-Composer: Using MLLMs for Assembly with Diverse Bricks
Title（参考訳）: Brick-Composer: 横れんがのアセンブリにMLLMを使用する
Authors: Jiateng Liu, Bingxuan Li, Zhenhailong Wang, Rushi Wang, Kaiwen Hong, Cheng Qian, Jiayu Liu, Denghui Zhang, Katherine Driggs-Campbell, Manling Li, Heng Ji,
Abstract要約: ブロック組立に必要な視覚的接地と空間的推論能力を有する多モーダル大言語モデル(MLLM)について検討した。 BC-Benchは,多様なブロックを持つアセンブリ上でMLLMを評価するための最初のベンチマークである。 Brick-ComposerはMLLMに3つの補完信号による組立スキルを組み込む学習フレームワークである。
参考スコア（独自算出の注目度）: 64.5380622477211
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate brick assembly as a sequential decision-making problem, where each step involves two subtasks: brick selection, identifying the target brick from candidate components, and brick pose estimation, predicting where and how the selected brick should be placed. To support this study, we introduce BC-Bench (Brick Construction Benchmark), the first benchmark for evaluating MLLMs on assembly with diverse bricks. Experiments show that current state-of-the-art MLLMs remain far from reliable builders, struggling with fine-grained brick selection and failing at precise pose estimation. To bridge this gap, we propose Brick-Composer, a learning framework that equips MLLMs with assembly skills through three complementary signals: Human Design Sparks, which provide affordance-rich construction demonstrations; World Feedback, which grounds predicted actions in visual and physical consequences; and Synthetic Experience, which scales learning beyond existing object designs. Brick-Composer improves brick selection accuracy by over three times, substantially reduces pose estimation errors, and raises strict step-level assembly success from less than 1% to around 15%. After training, a Qwen-3-8B can correctly compose up to 42% of the steps for a complete object, suggesting that MLLMs can acquire assembly capabilities through targeted, physically grounded learning.
Abstract（参考訳）: 私たちは、任意のデザインを読み、再利用可能なビルディングブロックから現実世界のオブジェクトを構築することができるAIエージェントを夢見ています。このビジョンに向けた第一歩として,ブロック組立に必要な視覚的基盤と空間的推論能力を有するマルチモーダル・大規模言語モデル (MLLM) について検討する。ブロック集合を逐次決定問題として定式化し, それぞれのステップは, ブロック選択, 対象のブロックを候補成分から識別, およびブロックポーズ推定と, 選択したブロックをどこに配置すべきか, どのように配置すべきかの2つのサブタスクを含む。本稿では,BC-Bench (Brick Construction Benchmark) について紹介する。実験によると、現在の最先端のMLLMは信頼性の高いビルダーから遠く離れており、きめ細かいレンガの選別に苦労し、正確なポーズ推定に失敗している。このギャップを埋めるために、我々はBrick-Composerという機械学習フレームワークを提案し、これはMLLMに組立スキルを付与する3つの補完的な信号を通して、人間のデザイン・スパークス、視覚的および身体的結果の予測を根拠とした世界フィードバック、既存のオブジェクトデザインを超えて学習をスケールするシンセティック・エクスペリエンスである。 Brick-Composerは、ブロック選択の精度を3倍に改善し、ポーズ推定エラーを大幅に低減し、厳格なステップレベルのアセンブリ成功を1%未満から15%程度に引き上げる。トレーニング後、Qwen-3-8Bは、完全なオブジェクトのステップの42%を正しく構成することができ、MLLMがターゲットとなる物理的基礎学習を通じてアセンブリ機能を取得することができることを示唆している。

論文の概要: Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

関連論文リスト