Fugu-MT 論文翻訳(概要): GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation

論文の概要: GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation

arxiv url: http://arxiv.org/abs/2605.20752v2
Date: Thu, 28 May 2026 12:25:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 05:02:24.515501
Title: GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation
Title（参考訳）: GaussianDream:ロボットマニピュレーションのためのフィードフォワード3次元ガウス世界モデル
Authors: Zijian Zhang, Yuqing Jiang, Qian Cheng, Xiaofan Li, Si Liu, Ding Zhao, Ping Luo, Weitao Zhou, Haibao Yu,
Abstract要約: 視覚言語アクション(VLA)ポリシーは、セマンティック先行をアクション生成に転送することで、言語条件のロボット操作を進化させた。標準的な行動模倣学習は、しばしば明示的な3次元空間情報、密集した幾何学的監督、将来の環境進化の十分なモデリングを欠いている。フィードフォワード3Dガウス世界モデルプラグインである textbfGaussianDream を提案する。
参考スコア（独自算出の注目度）: 54.671815855499034
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language-action (VLA) policies have advanced language-conditioned robotic manipulation by transferring semantic priors from pretrained vision-language models to action generation. However, standard action-imitation learning often lacks sufficient modeling of explicit 3D spatial information, dense geometric supervision, and future environment evolution, all critical for precise robotic interaction. To address this, we propose \textbf{GaussianDream}, a feed-forward 3D Gaussian world-model plug-in. Specifically, we introduce learnable GaussianDream Queries in the encoder, enabling the model to capture current-frame 3D spatial structure and short-horizon future evolution. During training, the latent GaussianDream prefix is processed by a static reconstruction head and a future prediction head to produce current 3D Gaussian scene states and future Gaussian evolution states. The current branch is supervised by RGB rendering and depth, while the future branch uses future RGB, depth, and pseudo 3D scene-flow signals. During inference, GaussianDream discards all auxiliary heads and retains only the learned prefix to condition action generation, without test-time Gaussian reconstruction or future prediction. Experimental results demonstrate that GaussianDream achieves state-of-the-art performance across multiple robotic manipulation benchmarks, reaching \textbf{98.4\%} on LIBERO, \textbf{54.8\%} on RoboCasa Human-50, and \textbf{50.0\%} on real-robot tasks. Compared with existing 3D-enhanced VLA methods, GaussianDream achieves strong accuracy while providing higher inference efficiency than video-based world-model approaches.
Abstract（参考訳）: ビジョン言語アクション(VLA)ポリシーは、事前訓練されたビジョン言語モデルからアクション生成へのセマンティック事前の転送によって、言語条件のロボット操作を進化させた。しかし、標準的なアクション・シミュレーション学習には、明快な3次元空間情報、密集した幾何学的監督、将来の環境進化のモデリングが欠如していることが多い。これを解決するために,フィードフォワード3Dガウス世界モデルプラグインである \textbf{GaussianDream} を提案する。具体的には,学習可能なGaussianDream Queriesをエンコーダに導入し,現在の3次元空間構造と短期的未来進化をモデルとして捉えた。トレーニング中、潜伏したGaussianDreamプレフィックスは静的再構成ヘッドと将来の予測ヘッドによって処理され、現在の3Dガウスシーン状態と将来のガウス進化状態を生成する。現在のブランチはRGBレンダリングと深さで管理され、将来のブランチは将来のRGB、深さ、擬似3Dシーンフロー信号を使用する。推論中、GaussianDreamはすべての補助ヘッドを破棄し、テストタイムのガウス再構成や将来の予測なしに、条件アクション生成のための学習したプレフィックスのみを保持する。 GaussianDreamは、複数のロボット操作ベンチマークで最先端のパフォーマンスを実現し、LIBEROでは \textbf{98.4\%}、RoboCasa Human-50では \textbf{54.8\%}、実ロボットタスクでは \textbf{50.0\%} に達した。既存の3DエンハンスドVLA法と比較すると,GaussianDreamはビデオベースのワールドモデル手法よりも推論効率が高い。

論文の概要: GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation

関連論文リスト