Fugu-MT 論文翻訳(概要): Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

論文の概要: Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

arxiv url: http://arxiv.org/abs/2605.30280v1
Date: Thu, 28 May 2026 17:36:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.634521
Title: Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Title（参考訳）: Qwen-VLA:タスク、環境、ロボットの身体における視覚・言語・行動モデリングの統合
Authors: Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, Hang Yin, Ye Wang, Yiyang Huang, Zixing Lei, Wujian Peng, Delin Chen, Yingming Zheng, Jingyang Fan, Xianwei Zhuang, Xin Zhou, Haoyang Li, Anzhe Chen, Tong Zhang, Xuejing Liu, Yuchong Sun, Ruizhe Chen, Zhaohai Li, Chenxu Lü, Zhibo Yang, Tao Yu, Xionghui Chen,
Abstract要約: 身体的な知性は、操作やナビゲーションといった個々のタスクのための特別なモデルを通してしばしば研究される。本稿では,Qwenの視覚言語モデリングスタックを連続的な動作と軌道生成に拡張した統一的な基礎モデルであるQwen-VLAを提案する。
参考スコア（独自算出の注目度）: 96.23886784364997
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.
Abstract（参考訳）: 身体的知性は、操作やナビゲーションのような個々のタスクの特殊なモデルを通して研究され、結果として断片化された能力とタスク、環境、ロボットの体現物に対する限定的な一般化をもたらす。本研究では,不均一な具体化決定問題を単一の視覚-言語-行動モデルで統一できるかどうかを考察する。本稿では、Qwenの視覚言語モデリングスタックを知覚、理解、推論から、DiTベースのアクションデコーダによる連続的なアクションおよび軌道生成まで拡張する統一的な基盤モデルであるQwen-VLAを提案する。 Qwen-VLAは、ロボット操作軌道、人間中心のデモ、合成シミュレーションデータ、視覚・言語ナビゲーションデータ、軌跡中心の監督、補助視覚言語データなど、さまざまなデータソース上で大規模な共同事前学習のレシピで訓練されている。複数のロボットプラットフォームをサポートするために,ロボット固有のテキスト記述が現在の実施・制御規約を規定するエンボディメント対応プロンプトコンディショニングを導入する。我々はさらに、操作、ナビゲーション、軌道予測を統合された行動・軌道予測フレームワークにキャストし、ロボット形態、タスクファミリー、環境をまたいだ移動可能な視覚的接地、空間的推論、連続的な行動生成を可能にする。操作、ナビゲーション、トラジェクトリ中心のベンチマーク実験は、シーンレイアウト、背景、照明、オブジェクト構成、ロボットエンボディメントのバリエーションの下で、一貫したマルチタスク性能と分散の一般化を示す。 Qwen-VLA-Instruct は LIBERO では 97.9%、Simpler-WidowX では 73.7%、RoboTwin-Easy/Hard では 86.1%/87.2%、R2R では 69.0%、RxR では 59.6%、ALOHA 実験では 76.9%、DOMINO の動的操作では 26.6% のゼロショット成功がある。

論文の概要: Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

関連論文リスト