Fugu-MT 論文翻訳(概要): Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

論文の概要: Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

arxiv url: http://arxiv.org/abs/2510.21571v1
Date: Fri, 24 Oct 2025 15:39:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 09:00:15.523063
Title: Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
Title（参考訳）: 実生活の人間活動映像を用いたロボットマニピュレーションのためのスケーラブルな視覚・言語・行動モデルの構築
Authors: Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, Yizhong Zhang, Xi Chen, Hao Chen, Lily Sun, Dong Chen, Jiaolong Yang, Baining Guo,
Abstract要約: 我々は、任意の手動ビデオのための完全自動化された総合的人間活動分析手法を開発した。大量のエゴセントリックなビデオを処理し、100Mエピソードと26Mフレームを含む手動VLAトレーニングデータセットを作成します。我々は手動VLAモデルアーキテクチャを設計し、このデータセット上でモデルを事前訓練する。
参考スコア（独自算出の注目度）: 42.86535655563404
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents a novel approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that "in-the-wild" egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data covers a wide range of objects and concepts, dexterous manipulation tasks, and environment variations in real life, vastly exceeding the coverage of existing robot data. We design a dexterous hand VLA model architecture and pretrain the model on this dataset. The model exhibits strong zero-shot capabilities on completely unseen real-world observations. Additionally, fine-tuning it on a small amount of real robot action data significantly improves task success rates and generalization to novel objects in real robotic experiments. We also demonstrate the appealing scaling behavior of the model's task performance with respect to pretraining data scale. We believe this work lays a solid foundation for scalable VLA pretraining, advancing robots toward truly generalizable embodied intelligence.
Abstract（参考訳）: 本稿では,人間の手の動きを実生活で記録する大規模コーパスを用いて,ロボット操作型ビジョン・ランゲージ・アクション(VLA)モデルを事前訓練するための新しいアプローチを提案する。人間の手をデキスタスなロボットのエンドエフェクターとして扱うことで、アノテーションのない「理想的」人間中心ビデオは、タスクの粒度やラベルの観点から既存のロボットV-L-Aトレーニングデータと完全に整合したデータ形式に変換できることを示す。これは、任意の手動ビデオのための完全自動化された全体的人間活動分析手法の開発によって達成される。このアプローチは、原子レベルの手の動きセグメントとその言語記述を生成することができ、それぞれにフレームワイドな手の動きとカメラの動きが伴う。大量のエゴセントリックなビデオを処理し、100Mエピソードと26Mフレームを含む手動VLAトレーニングデータセットを作成します。このトレーニングデータは、さまざまなオブジェクトやコンセプト、巧妙な操作タスク、実生活における環境変動をカバーしており、既存のロボットデータのカバレッジを大きく超えている。我々は手動VLAモデルアーキテクチャを設計し、このデータセット上でモデルを事前訓練する。このモデルは、完全に見えない現実世界の観測に強いゼロショット能力を示す。さらに、少量の実際のロボット動作データに基づいて微調整することで、実際のロボット実験においてタスクの成功率と新しいオブジェクトへの一般化が大幅に向上する。また、事前学習データスケールに関して、モデルのタスクパフォーマンスの魅力的なスケーリング挙動を示す。この研究は、スケーラブルなVLA事前訓練の基盤となり、ロボットを真に一般化可能な具体化インテリジェンスへと前進させます。

論文の概要: Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

関連論文リスト