Fugu-MT 論文翻訳(概要): BLM$_1$: A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning

論文の概要: BLM$_1$: A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning

arxiv url: http://arxiv.org/abs/2510.24161v1
Date: Tue, 28 Oct 2025 07:58:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:36.897649
Title: BLM$_1$: A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning
Title（参考訳）: BLM$_1$: クロススペース,クロスタスク,クロスエボディメント学習のための境界のない大規模モデル
Authors: Wentao Tan, Bowen Wang, Heng Zhi, Chenyu Liu, Zhe Li, Jian Liu, Zengrong Lin, Yukun Dai, Yipeng Chen, Wenjie Yang, Enci Xie, Hao Xue, Baixu Ji, Chen Xu, Zhibin Wang, Tianshi Wang, Lei Zhu, Heng Tao Shen,
Abstract要約: MLLM(Multimodal large language model)は、高度な視覚言語推論を持ち、エンボディエージェントへの展開が増えている。我々は,ロバストなクロスボディーメント制御をサポートするマルチモーダル空間基盤モデルであるtextbfBoundless Large Model (BLM$_1$)を紹介する。
参考スコア（独自算出の注目度）: 68.85121620506119
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have advanced vision-language reasoning and are increasingly deployed in embodied agents. However, significant limitations remain: MLLMs generalize poorly across digital-physical spaces and embodiments; vision-language-action models (VLAs) produce low-level actions yet lack robust high-level embodied reasoning; and most embodied large language models (ELLMs) are constrained to digital-space with poor generalization to the physical world. Thus, unified models that operate seamlessly across digital and physical spaces while generalizing across embodiments and tasks remain absent. We introduce the \textbf{Boundless Large Model (BLM$_1$)}, a multimodal spatial foundation model that preserves instruction following and reasoning, incorporates embodied knowledge, and supports robust cross-embodiment control. BLM$_1$ integrates three key capabilities -- \textit{cross-space transfer, cross-task learning, and cross-embodiment generalization} -- via a two-stage training paradigm. Stage I injects embodied knowledge into the MLLM through curated digital corpora while maintaining language competence. Stage II trains a policy module through an intent-bridging interface that extracts high-level semantics from the MLLM to guide control, without fine-tuning the MLLM backbone. This process is supported by a self-collected cross-embodiment demonstration suite spanning four robot embodiments and six progressively challenging tasks. Evaluations across digital and physical benchmarks show that a single BLM$_1$ instance outperforms four model families -- MLLMs, ELLMs, VLAs, and GMLMs -- achieving $\sim\!\textbf{6%}$ gains in digital tasks and $\sim\!\textbf{3%}$ in physical tasks.
Abstract（参考訳）: MLLM(Multimodal large language model)は、高度な視覚言語推論を持ち、エンボディエージェントへの展開が増えている。しかし、MLLMはデジタル物理空間やエンボディメントをまたいだ一般化が不十分であり、視覚言語アクションモデル(VLA)は低レベルなアクションを生成するが、ロバストな高レベルなエンボディド推論を欠いている。したがって、デジタル空間と物理空間をまたいでシームレスに機能する統一モデルや、具体化やタスクをまたいで一般化するモデルはいまだに存在しない。我々は,指示の追従と推論を保存し,具体的知識を取り入れ,堅牢なクロスボデーメント制御をサポートするマルチモーダル空間基盤モデルである,‘textbf{Boundless Large Model(BLM$_1$)}を紹介する。 BLM$_1$は、2段階のトレーニングパラダイムを通じて、3つの重要な機能 – \textit{cross-space transfer, cross-task learning, cross-embodiment generalization} – を統合する。ステージIは、言語能力を維持しながら、キュレートされたデジタルコーパスを通じてMLLMに具体的知識を注入する。 Stage IIはポリシーモジュールを、MLLMのバックボーンを微調整することなく、MLLMから高レベルのセマンティクスを抽出してコントロールをガイドするインテントブリッジインタフェースを通じてトレーニングする。このプロセスは、4つのロボットエボディメントと6つの段階的な課題からなる自己コンパイル型クロスボディデモスイートによって支援されている。デジタルと物理のベンチマークによる評価によると、単一のBLM$_1$インスタンスはMLLM、ELLM、VLA、GMLMの4つのモデルファミリを上回り、$\sim\! \textbf{6%}$ gains in digital task and $\sim\! 物理タスクで \textbf{3%}$。

論文の概要: BLM$_1$: A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning

関連論文リスト