Fugu-MT 論文翻訳(概要): Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

論文の概要: Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

arxiv url: http://arxiv.org/abs/2606.17030v2
Date: Tue, 16 Jun 2026 16:55:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 15:01:46.835746
Title: Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation
Title（参考訳）: Qwen-RobotWorld Technical Report: Unified Embodied World Modeling through Language-Conditioned Video Generation
Authors: Jie Zhang, Xiaoyue Chen, Anzhe Chen, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu,
Abstract要約: 具体的インテリジェンスのための言語条件付きビデオワールドモデルであるQwen-RobotWorldを紹介する。ロボット操作、自律運転、屋内ナビゲーション、人間とロボットの移動など、現在の観察結果から、物理的に基礎付けられた将来の視覚軌道を予測する。
参考スコア（独自算出の注目度）: 80.92703471330982
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.
Abstract（参考訳）: 具体的インテリジェンスのための言語条件付きビデオワールドモデルであるQwen-RobotWorldを紹介する。自然言語を統合されたアクションインターフェースとして使用することで、ロボット操作、自律運転、屋内ナビゲーション、人間とロボットの移動といった現在の観察から、物理的に基盤付けられた将来の視覚軌道を予測する。この統合された定式化は、ポリシートレーニング強化のための合成データ生成、ポリシー評価のためのスケーラブルな仮想環境、下流ロボット制御のための言語誘導計画信号の3つの有望な応用方向を提供する。これは三部構成で達成される。 a) MLLMアクションエンコーディングによる二重ストリームMMDiT b)Embodied World Knowledge(EWK)は、20以上の実施形態及び500以上の行動カテゴリーにまたがる行動言語マッピングを備えた8.6Mビデオテキストコーパス(200M+フレーム)である。 c) General+Expert Progressive Curriculumは、2段階のトレーニング戦略で、まず一般的な視覚的事前学習を行い、その後、共有言語インタフェースの下で具体化された特殊化を注入する。 EWMBenchとDreamGen Benchは、WorldModelBenchとPBenchのすべてのオープンソースモデルを上回っています。 RoboTwin-IFベンチマークのさらなるゼロショット解析は、堅牢な一般化とマルチビュー一貫性をさらにサポートする。

論文の概要: Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

関連論文リスト