Fugu-MT 論文翻訳(概要): RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

論文の概要: RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

arxiv url: http://arxiv.org/abs/2509.15212v1
Date: Thu, 18 Sep 2025 17:58:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-19 17:26:53.390065
Title: RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
Title（参考訳）: RynnVLA-001:人間によるロボット操作の改善
Authors: Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li,
Abstract要約: RynnVLA-001は、人間のデモンストレーションから大規模ビデオ生成前訓練に基づいて構築された視覚言語アクション(VLA)モデルである。最初のステージであるEgo-Centric Video Generative Pretrainingは、12Mのエゴ中心の操作ビデオ上で、画像からビデオまでのモデルを訓練し、初期フレームと言語命令で条件付けられた将来のフレームを予測する。第2段階であるHuman-Centric Trajectory-Aware Modelingは、将来のキーポイント軌跡を共同で予測することでこれを拡張し、アクション予測による視覚的フレーム予測を効果的にブリッジする。
参考スコア（独自算出の注目度）: 39.383510768790295
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.
Abstract（参考訳）: 本稿では、人間の実演による大規模ビデオ生成前訓練に基づく視覚言語アクション(VLA)モデルであるRynnVLA-001を提案する。本稿では,新しい2段階事前学習手法を提案する。最初のステージであるEgo-Centric Video Generative Pretrainingは、12Mのエゴ中心の操作ビデオ上で、画像からビデオまでのモデルを訓練し、初期フレームと言語命令で条件付けられた将来のフレームを予測する。第2段階であるHuman-Centric Trajectory-Aware Modelingは、将来のキーポイント軌跡を共同で予測することでこれを拡張し、アクション予測による視覚的フレーム予測を効果的にブリッジする。さらに, 動作表現を向上させるために, VLA出力空間の複雑さを低減し, 動作列をコンパクトな遅延埋め込みに圧縮する変分オートエンコーダであるActionVAEを提案する。同じ下流ロボットデータセットを微調整すると、RynnVLA-001は最先端のベースラインよりも優れたパフォーマンスを実現し、提案された事前学習戦略がVLAモデルに対してより効果的な初期化を提供することを示した。

論文の概要: RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

関連論文リスト