Fugu-MT 論文翻訳(概要): AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models

論文の概要: AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models

arxiv url: http://arxiv.org/abs/2603.08519v1
Date: Mon, 09 Mar 2026 15:52:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:42.11152
Title: AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models
Title（参考訳）: AtomVLA: 予測潜在世界モデルによるロボットマニピュレーションのためのスケーラブルなポストトレーニング
Authors: Xiaoquan Sun, Zetian Xu, Chen Cao, Zonghe Liu, Yihan Sun, Jingrui Pang, Ruijian Zhang, Zhen Yang, Kang Pang, Dingxin He, Mingqi Yuan, Jiayu Chen,
Abstract要約: VLA(Vision-Language-Action)モデルでは、一般化可能なロボット操作の可能性を示している。現在のパラダイムは、教師付き微調整中の粗大でハイレベルなタスク命令に依存している。スケーラブルなオフライン後トレーニングパイプラインと統合された,最初のサブタスク対応VLAフレームワークである方法を提案する。
参考スコア（独自算出の注目度）: 9.608633915316252
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The execution of complex multi-step behaviors in VLA models can be improved by robust instruction grounding, a critical component for effective control. However, current paradigms predominantly rely on coarse, high-level task instructions during supervised fine-tuning. This instruction grounding gap leaves models without explicit intermediate guidance, leading to severe compounding errors in long-horizon tasks. Therefore, bridging this instruction gap and providing scalable post-training for VLA models is urgent. To tackle this problem, we propose \method, the first subtask-aware VLA framework integrated with a scalable offline post-training pipeline. Our framework leverages a large language model to decompose high-level demonstrations into fine-grained atomic subtasks. This approach utilizes a pretrained predictive world model to score candidate action chunks against subtask goals in the latent space, mitigating error accumulation while significantly improving long-horizon robustness. Furthermore, this approach enables highly efficient Group Relative Policy Optimization without the prohibitive expenses associated with online rollouts on physical robots. Extensive simulations validate that our AtomVLA maintains strong robustness under perturbations. When evaluated against fundamental baseline models, it achieves an average success rate of 97.0\% on the LIBERO benchmark and 48.0\% on the LIBERO-PRO benchmark. Finally, experiments conducted in the real world using the Galaxea R1 Lite platform confirm its broad applicability across diverse tasks, especially long-horizon tasks. All datasets, checkpoints, and code will be released to the public domain following the acceptance of this work for future research.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルでは、一般化可能なロボット操作の可能性を示している。 VLAモデルにおける複雑なマルチステップ動作の実行は、効率的な制御のための重要なコンポーネントであるロバストな命令接地によって改善することができる。しかし、現在のパラダイムは主に教師付き微調整中の粗大でハイレベルなタスク命令に依存している。この命令グラウンドグラウンドメントギャップは、明確な中間ガイダンスなしでモデルを残し、長い水平タスクで複雑なエラーを引き起こす。したがって、この命令ギャップを埋めて、VLAモデルのスケーラブルなポストトレーニングを提供するのが急務である。この問題に対処するために,スケーラブルなオフライン後トレーニングパイプラインと統合された最初のサブタスク対応VLAフレームワークであるShamethodを提案する。我々のフレームワークは、大規模言語モデルを利用して、高レベルのデモをきめ細かい原子サブタスクに分解する。このアプローチは、事前訓練された予測世界モデルを用いて、潜在空間におけるサブタスク目標に対する候補アクションチャンクをスコアし、エラーの蓄積を軽減し、長期的ロバスト性を大幅に改善する。さらに,本手法は,物理ロボットのオンラインロールアウトに伴う禁忌費用を伴わずに,高効率なグループ相対政策最適化を実現する。大規模なシミュレーションでは、我々のAtomVLAは摂動下で強い堅牢性を維持している。基本的なベースラインモデルに対して評価すると、LIBEROベンチマークでは97.0\%、LIBERO-PROベンチマークでは48.0\%となる。最後に、Galaxea R1 Liteプラットフォームを用いて現実世界で行われた実験により、様々なタスク、特に長距離タスクにまたがる幅広い適用性が確認された。すべてのデータセット、チェックポイント、コードは、将来の研究のためにこの研究が受け入れられた後、パブリックドメインにリリースされます。

論文の概要: AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models

関連論文リスト