Fugu-MT 論文翻訳(概要): MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

論文の概要: MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

arxiv url: http://arxiv.org/abs/2605.12624v2
Date: Thu, 14 May 2026 17:59:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 18:18:46.745738
Title: MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
Title（参考訳）: MindVLA-U1:VLAが自律運転のための統一ストリーミングアーキテクチャでVAに勝る
Authors: Yuzhou Huang, Benjin Zhu, Hengtong Lu, Victor Shea-Jay Huang, Haiming Zhang, Wei Chen, Jifeng Dai, Yan Xie, Hongsheng Li,
Abstract要約: 我々は、自動運転のための最初の統合ストリーミングVLAアーキテクチャであるMindVLA-U1を紹介する。統一されたVLMバックボーンは、1つの共有表現に1つのフォワードパスでAR言語トークンとフローマッチングされた連続的なアクショントラジェクトリを生成する。ロングテールのWOD-E2Eベンチマークでは、MindVLA-U1が経験豊富な人間のドライバーを初めて上回った。
参考スコア（独自算出の注目度）: 54.57163800903507
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces AR language tokens (optional) and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A full streaming design processes the driving video framewise rather than as fixed video-action chunks under costly temporal VLM modeling. Planned trajectories evolve smoothly across frames while a learned streaming memory channel carries temporal context and updates. The unified architecture enables fast/slow systems on dense & sparse MoT backbones via flexible self-attention context management, and exposes a measurable language-control path for action: language-predicted driving intents steers the action diffusion via classifier-free guidance (CFG), turning language-side intent into control signals for continuous action planning. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA by large margins, and matches VA latency (16 FPS vs. RAP's 18 FPS at 1B scale) while preserving natural language interfaces for human-vehicle interaction.
Abstract（参考訳）: 自律運転はモジュラーパイプラインからエンドツーエンドの統一へと進歩し、ビジョン・ランゲージ・アクション(VLA)モデルはビジョン・ツー・アクション(VA)を超えたこの旅の自然な延長である。実際には、VLAの運転は計画品質においてVAに追随することが多く、難易度は単にモデルスケールではなく、意味論的推論、時間的コンテキスト、継続的な制御が組み合わさったインターフェースであることが示唆されている。このギャップは、VLAとは何かよりもむしろ、コヒーレントな運転能力を構成するのに失敗する独立したサブタスクの改善として、VLAがどのように構築されたのかを反映している、と私たちは主張する。我々は、自動運転のための最初の統合ストリーミングVLAアーキテクチャであるMindVLA-U1を紹介する。統一されたVLMバックボーンはAR言語トークン(オプション)を生成し、フローマッチングされた連続的な動作軌跡を1つの共有表現を越えて1つの前方通過し、各モードの自然な出力形式を保存する。フルストリーミング設計は、コストのかかる時間的VLMモデリングの下で、固定されたビデオアクションチャンクではなく、ドライブビデオフレームを適切に処理する。計画されたトラジェクトリはフレーム間でスムーズに進化し、学習されたストリーミングメモリチャネルは時間的コンテキストと更新を運ぶ。統合されたアーキテクチャは、厳密でスパースなMoTバックボーン上の高速/スローなシステムをフレキシブルな自己注意コンテキスト管理を通じて実現し、アクションのための計測可能な言語制御パスを公開する。ロングテールのWOD-E2Eベンチマークでは、MindVLA-U1は2つの拡散ステップで経験豊富な人間ドライバー(8.20 RFS vs. 8.13 GT RFS)を初めて上回り、最先端の計画 ADEをVA/VLAよりも大きなマージンで達成し、VAレイテンシ(16 FPS vs. RAP's 18 FPS at 1B scale)と一致する。

論文の概要: MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

関連論文リスト