Fugu-MT 論文翻訳(概要): CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

論文の概要: CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

arxiv url: http://arxiv.org/abs/2605.10426v2
Date: Wed, 13 May 2026 08:01:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 17:13:58.857906
Title: CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
Title（参考訳）: CoWorld-VLA: 自律運転のための多機能世界モデルを考える
Authors: Minqing Huang, Yujiao Xiang, Zihan Liang, Jiajie Huang, Jingqi Wang, Zhi Xu, Feiyang Tan, Hangning Zhou, Mu Yang, Gong Che,
Abstract要約: CoWorld-VLAは、自動運転のための多専門家の世界推論フレームワークである。世界表現は行動計画を導くための明確な条件として機能する。実験によると、CoWorld-VLAは将来のシーン生成と計画の両方で競争力を発揮する。
参考スコア（独自算出の注目度）: 4.4380564455353975
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/AFARI-Research/CoWorld-VLA.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、エンドツーエンドの自動運転において有望なパラダイムとして登場した。しかし、既存の推論メカニズムは、計画指向の中間表現の提供に苦慮している: テキスト・チェーン・オブ・ソート(CoT)は、持続的な時空間構造を維持することができず、潜在世界推論は、アクション生成の直接的な条件としての使用が困難である。本稿では,自律運転のための多専門的世界推論フレームワークであるCoWorld-VLAを提案する。 CoWorld-VLAは、複数ソースの監視を通じて補完的な世界情報を抽出し、VLA内のエキスパートトークンにエンコードすることで、プランナアクセス可能なコンディショニング信号を提供する。具体的には、意味的相互作用、幾何学的構造、動的進化、エゴ軌道トークンの4種類のトークンを構築し、それぞれが相互作用意図、空間構造、将来の時間的ダイナミクス、行動目標をモデル化する。アクション生成の間、CoWorld-VLAは拡散に基づく階層的多専門家融合プランナーを使用し、これは関節の認知過程全体を通してシーンコンテキストと結合して連続的なエゴ軌道を生成する。実験により、CoWorld-VLAは将来のシーン生成とNAVSIM v1ベンチマークの計画の両方において、衝突回避と軌道精度の強い性能を示す。アブレーション研究は、専門家トークンの相補性と、アクション生成の計画条件としてのそれらの有効性をさらに検証する。コードはhttps://github.com/AFARI-Research/CoWorld-VLAで入手できる。

論文の概要: CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

関連論文リスト