Fugu-MT 論文翻訳(概要): MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

論文の概要: MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

arxiv url: http://arxiv.org/abs/2605.12624v1
Date: Tue, 12 May 2026 18:09:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.61182
Title: MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
Title（参考訳）: MindVLA-U1:VLAが自律運転のための統一ストリーミングアーキテクチャでVAに勝る
Authors: Yuzhou Huang, Benjin Zhu, Hengtong Lu, Victor Shea-Jay Huang, Haiming Zhang, Wei Chen, Jifeng Dai, Yan Xie, Hongsheng Li,
Abstract要約: 我々は、自動運転のための最初の統合型ストリーミング・ビジョン・ランゲージ・アクション・アーキテクチャであるMindVLA-U1を提案する。統一されたVLMバックボーンは、1つの共有表現上の1つのフォワードパスで自動回帰言語トークンとフローマッチング連続アクショントラジェクトリを生成する。ロングテールのWOD-E2Eベンチマークでは、MindVLA-U1が経験豊富な人間のドライバーを初めて上回った。
参考スコア（独自算出の注目度）: 54.57163800903507
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose into coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces autoregressive language tokens and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A streaming design processes the driving video framewise rather than as fixed video-action chunks, while a learned memory channel carries temporal context across frames so planned trajectories evolve smoothly without redundant multi-frame VLM modeling. The unified architecture admits fast/slow execution on dense/sparse Mixture-of-Transformers (MoT) backbones via flexible self-attention context management, and exposes a measurable language-to-action route: a language-predicted driving intent steers action diffusion through classifier-free guidance (CFG), turning language-side intent into a control signal for continuous trajectory generation. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA methods by large margins, and matches VA-class throughput (16 FPS vs. RAP-DINO's 18 FPS) while preserving natural-language interfaces.
Abstract（参考訳）: 自律運転はモジュラーパイプラインからエンドツーエンドの統一へと進歩し、ビジョン・ランゲージ・アクション(VLA)モデルはビジョン・ツー・アクション(VA)を超えたこの旅の自然な延長である。実際には、VLAの運転は計画品質においてVAに追随することが多く、難易度は単にモデルスケールではなく、意味論的推論、時間的コンテキスト、継続的な制御が組み合わさったインターフェースであることが示唆されている。このギャップは、VLAとは何かではなく、分離されたサブタスクの改善がコヒーレントな運転能力に分解できないように、VLAがどのように構築されたのかを反映している、と私たちは主張する。我々は、自動運転のための最初の統合ストリーミングVLAアーキテクチャであるMindVLA-U1を紹介する。統一されたVLMバックボーンは、1つの共有表現に1つのフォワードパスで自動回帰言語トークンとフローマッチング連続アクショントラジェクトリを生成し、各モードの自然な出力形式を保存する。ストリーミング設計は、固定されたビデオアクションチャンクではなく、ドライブビデオフレームを適切に処理し、学習されたメモリチャネルは、フレーム間の時間的コンテキストを伝達するので、冗長なマルチフレームVLMモデリングなしで、計画されたトラジェクトリは円滑に進化する。統合されたアーキテクチャは、厳密でスパースなMixture-of-Transformers(MoT)バックボーン上での高速/スロー実行をフレキシブルな自己アテンションコンテキスト管理を通じて認め、測定可能な言語対アクション経路を公開する。ロングテールのWOD-E2Eベンチマークでは、MindVLA-U1は2つの拡散ステップを持つ経験豊富な人間ドライバー(8.20 RFS vs. 8.13 GT RFS)を初めて上回り、従来のVA/VLAメソッドよりも大きなマージンで最先端の計画ADEを実現し、自然言語インタフェースを保ちながらVAクラスのスループット(16 FPS vs. RAP-DINOの18 FPS)にマッチする。

論文の概要: MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

関連論文リスト