Fugu-MT 論文翻訳(概要): Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

論文の概要: Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

arxiv url: http://arxiv.org/abs/2605.15153v1
Date: Thu, 14 May 2026 17:50:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.996592
Title: Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
Title（参考訳）: Pelican-Unified 1.0: 理解、推論、想像、行動のための統一された身体情報モデル
Authors: Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, Yechi Liu, Haoyuan Shi, Zecong Tang, Haosong Sun, Renwen Cui, Kuishu Wu, Wenhai Liu, Yang Xu, Yingji Zhang, Yidong Wang, Senkang Hu, Jinpeng Lu, Nga Teng Chan, Yechen Wu, Yong Dai, Jian Tang, Xiaozhu Ju,
Abstract要約: Pelican-Unified 1.0は、統一の原則に従って訓練された最初の基礎モデルである。単一のVLMを統一理解モジュールとして使用し、シーン、指示、視覚的コンテキスト、アクション履歴を共有意味空間にマッピングする。単一のチェックポイントで、Pelican-Unified 1.0は3つの機能にまたがって強力なパフォーマンスを実現している。
参考スコア（独自算出の注目度）: 35.968153930385434
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Pelican-Unified 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unified 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training, rather than training three isolated expert systems. Experiments demonstrate that unification does not imply compromise. With a single checkpoint, Pelican-Unified 1.0 achieves strong performance across all three capabilities: 64.7 on eight VLM benchmarks, the best among comparable-scale models; 66.03 on WorldArena, ranking first; and 93.5 on RoboTwin, the second-best average among compared action methods. These results show that the unified paradigm succeeds in preserving specialist strength while bringing understanding, reasoning, imagination, and action into one model.
Abstract（参考訳）: We present Pelican-Unified 1.0, the first embodied foundation model training on the principle of unification。 Pelican-Unified 1.0は単一のVLMを統一理解モジュールとして使用し、シーン、命令、視覚的コンテキスト、アクション履歴を共有意味空間にマッピングする。同じVLMは統一推論モジュールとしても機能し、タスク、アクション、未来指向の思考のチェーンを単一の前方通過で自動生成し、最後の隠れ状態が密度の高い潜伏変数に投影する。統一フューチャージェネレータ(UFG)は、この潜伏変数の条件を定め、同じデノナイジングプロセス内で2つのモード固有の出力ヘッドを通して、将来のビデオと将来のアクションを共同で生成する。言語、ビデオ、アクションの損失はすべて共有表現に逆転し、モデルは3つの独立したエキスパートシステムをトレーニングするのではなく、トレーニング中の理解、推論、想像、アクションを共同で最適化することができる。実験は統一が妥協を示唆しないことを示した。 8つのVLMベンチマークでは64.7、WorldArenaでは66.03、比較されたアクションメソッドでは93.5、RoboTwinでは93.5である。これらの結果は、統一パラダイムが、理解、推論、想像、行動の1つのモデルに持ち込みながら、専門的強度を維持することに成功していることを示している。

論文の概要: Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

関連論文リスト