Fugu-MT 論文翻訳(概要): GigaBrain-0: A World Model-Powered Vision-Language-Action Model

論文の概要: GigaBrain-0: A World Model-Powered Vision-Language-Action Model

arxiv url: http://arxiv.org/abs/2510.19430v1
Date: Wed, 22 Oct 2025 09:57:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:15.541163
Title: GigaBrain-0: A World Model-Powered Vision-Language-Action Model
Title（参考訳）: GigaBrain-0:世界モデル駆動ビジョンランゲージ・アクションモデル
Authors: GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye, Yukun Zhou, Zhehao Dong, Zhenan Wang, Zhichao Liu, Zheng Zhu,
Abstract要約: 我々は、世界モデル生成データによって強化された新しいVLA基盤モデルであるGigaBrain-0を紹介する。 GigaBrain-0は、タスク間の一般化を改善しながら、実際のロボットデータへの依存を著しく低減する。また、NVIDIA Jetson AGX Orinのようなデバイス上で効率的に動作するように設計された軽量なGigaBrain-0-Smallも紹介する。
参考スコア（独自算出の注目度）: 44.08074448490287
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.
Abstract（参考訳）: 汎用ロボットのための訓練用ビジョンランゲージ・アクション(VLA)モデルは、一般的に大規模な現実世界のロボットデータを必要とする。物理データ収集の非効率性は、現在のVLAシステムのスケーラビリティと一般化能力を著しく制限する。この課題に対処するため、GigaBrain-0は、世界モデル生成データ(例えば、ビデオ生成、リアル2リアル転送、ヒューマン転送、ビュー転送、sim2real転送データ)によって強化された新しいVLA基盤モデルである。 GigaBrain-0は、世界モデルを利用して多様なデータを大規模に生成することにより、実際のロボットデータへの依存を著しく低減し、クロスタスクの一般化を改善した。提案手法は,RGBD入力モデルとCoT(Chain-of-Thought)監視を具体化することにより,タスク実行中の空間幾何学,オブジェクト状態,長期依存性をモデル化する。これにより、デクスタラス、ロングホライゾン、モバイル操作タスクにおける現実世界のパフォーマンスが大幅に向上する。 GigaBrain-0は、外観(例えば、テクスチャ、色)、オブジェクト配置、カメラ視点など)のさまざまなバリエーションにおいて、優れた一般化を実現している。さらに,NVIDIA Jetson AGX Orinなどのデバイス上で効率的に動作するように設計された軽量なGigaBrain-0-Smallを提案する。

論文の概要: GigaBrain-0: A World Model-Powered Vision-Language-Action Model

関連論文リスト