Fugu-MT 論文翻訳(概要): AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

論文の概要: AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

arxiv url: http://arxiv.org/abs/2603.14851v1
Date: Mon, 16 Mar 2026 05:50:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:36.076689
Title: AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving
Title（参考訳）: AutoMoT: エンド・ツー・エンド自動運転のための非同期混合変圧器を用いた統合ビジョンランゲージ・アクションモデル
Authors: Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, Chen Lv,
Abstract要約: OURSは、単一の視覚言語アクション(VLA)モデルで推論とアクション生成を統合するエンドツーエンドのADフレームワークである。 OURSは最先端の手法と比較して競争性能が高いことを示す。
参考スコア（独自算出の注目度）: 36.82081211127408
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose \OURS in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that \OURS achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to \href{https://automot-website.github.io/}{Project Page} for the demonstration videos and qualitative results.
Abstract（参考訳）: 視覚言語モデル(VLM)をエンド・ツー・エンド(E2E)自動運転(AD)システムに統合することで、シーン理解の改善が期待できる。しかし、既存の統合戦略にはいくつかの制限がある: 推論とアクション空間の間の分布の不整合を解決するのに苦労する、事前訓練されたVLMの一般的な推論能力を過小評価する、あるいは、動作ポリシー生成時にかなりの推論遅延を発生させる、などである。これらの課題に対処するため、我々は、単一の視覚言語アクションモデル(VLA)モデルにおいて、推論とアクション生成を統一するエンドツーエンドADフレームワークである‘OURS’を提案する。提案手法では,異なるタスク周波数での非同期実行による高速スロー推論を実現するとともに,事前学習したVLMの一般的な推論能力を保ちながら,変換器の混合(MoT)アーキテクチャと共同注意共有を利用する。オープンループ設定とクローズループ設定の両方で、複数のベンチマークに対する大規模な実験により、‘OURS’は最先端の手法と比較して競争性能が向上することを示した。さらに,AD-tailored fine-tuningの必要性について検討し,ADにおける事前訓練VLMの機能的境界について検討した。この結果から,事前学習したVLMは意味的プロンプトを単独で行うことで,競争力のあるマルチタスクシーン理解性能を実現することができる一方で,細調整は意思決定や軌道計画といったアクションレベルのタスクには不可欠であることが示唆された。デモビデオと質的な結果について、 \href{https://automot-website.github.io/}{Project Page} を参照する。

論文の概要: AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

関連論文リスト