Fugu-MT 論文翻訳(概要): Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

論文の概要: Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

arxiv url: http://arxiv.org/abs/2606.17846v2
Date: Wed, 17 Jun 2026 17:06:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:50.783229
Title: Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models
Title（参考訳）: Qwen-RobotManip Technical Report: Orignment Unlocks Scale for Robotic Manipulation Foundation Models
Authors: Haoqi Yuan, Zhixuan Liang, Anzhe Chen, Ye Wang, Haoyang Li, Pei Lin, Yiyang Huang, Zixing Lei, Tong Zhang, Jiazhao Zhang, Jie Zhang, Jingyang Fan, Gengze Zhou, Qihang Peng, Chenxu Lv, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Chenfei Wu, Xiong-Hui Chen,
Abstract要約: 本稿では、Qwen-VL上に構築された一般化可能なビジョン・ランゲージ・アクション基盤モデルであるQwen-RobotManipを提案する。 Qwen-RobotManipは、操作の表現、動き、行動の次元にわたって統合されたアライメントフレームワークを導入している。人間とロボットの合成パイプラインは、エゴセントリックな手の動きを15プラットフォームにわたるロボットの軌道に変換する。
参考スコア（独自算出の注目度）: 95.75234389806654
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.
Abstract（参考訳）: 言語とマルチモーダリティの基盤モデルは、統一された定式化と大規模トレーニングの下で異種データを整列させることにより、強力な一般化を実現する。本稿では,このスケーリング手法をロボット操作に適用して,真の一般化を実現するかを検討する。テキストとは異なり、データ操作は本質的に不均一であり、収集コストが高く、多様性が狭く、アライメントとスケールを同時に困難にしているため、これは難しい。本稿では、Qwen-VL上に構築された一般化可能なビジョン・ランゲージ・アクション基盤モデルであるQwen-RobotManipを提案する。 Qwen-RobotManipは、操作の表現、動き、行動の次元にわたって統一されたアライメントフレームワークを導入し、競合するのではなく、大規模なマルチソーストレーニングをコヒーレントにする。このアライメント機能により、Qwen-RobotManipは、事前のトレーニング体制が維持できないスケールで操作データを吸収することができる。人間のロボット合成パイプラインは、エゴセントリックな手の動きを15プラットフォームにわたるロボット軌道に変換し、厳密なキュレーションパイプラインは異種データセットを調和させる。 Qwen-RobotManipは、プロプライエタリなデータ収集のない、オープンソースデータセットとヒューマンビデオのみを使用して、約38,100時間の事前トレーニングコーパスを構築し、ゼロショットインストラクションのフォロー、摂動に対する堅牢性、リアクティブエラーリカバリ、クロスエボデーション転送など、突発的な一般化機能を示す。標準ベンチマークはトレーニング前の品質を捉えず、代わりにRoboCasa365、LIBERO-Plus、EBench、RoboTwin-Clean2Rand、RoboTwin-IF、RoboTwin-XEといったOOD設定を採用する。 Qwen-RobotManipは、すべてのOOD設定で$π$0.5を含む最先端のモデルを大幅に上回り、RoboChallengeで20%改善され、AgileX ALOHA、Franka、UR、ARXといった実際のロボットプラットフォームで検証されている。

論文の概要: Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

関連論文リスト