Fugu-MT 論文翻訳(概要): Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

論文の概要: Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

arxiv url: http://arxiv.org/abs/2512.16760v2
Date: Sun, 04 Jan 2026 12:37:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.410568
Title: Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future
Title（参考訳）: 自動運転のための視覚・言語・アクションモデル:過去・現在・未来
Authors: Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, Junwei Liang,
Abstract要約: VLA(Vision-Language-Action)フレームワークは、認識と言語に基づく意思決定を統合する。 VLAフレームワークは、より解釈可能で、一般化可能で、人間に準拠した運転ポリシーへの道筋を提供する。この研究は、人間と互換性のある自動運転システムを構築するための一貫性のある基盤を確立することを目的としている。
参考スコア（独自算出の注目度）: 125.92052530850425
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, where hand-crafted interfaces and rule-based components often break down in complex or long-tailed scenarios. Their cascaded design further propagates perception errors, degrading downstream planning and control. Vision-Action (VA) models address some limitations by learning direct mappings from visual inputs to actions, but they remain opaque, sensitive to distribution shifts, and lack structured reasoning or instruction-following capabilities. Recent progress in Large Language Models (LLMs) and multimodal learning has motivated the emergence of Vision-Language-Action (VLA) frameworks, which integrate perception with language-grounded decision making. By unifying visual understanding, linguistic reasoning, and actionable outputs, VLAs offer a pathway toward more interpretable, generalizable, and human-aligned driving policies. This work provides a structured characterization of the emerging VLA landscape for autonomous driving. We trace the evolution from early VA approaches to modern VLA frameworks and organize existing methods into two principal paradigms: End-to-End VLA, which integrates perception, reasoning, and planning within a single model, and Dual-System VLA, which separates slow deliberation (via VLMs) from fast, safety-critical execution (via planners). Within these paradigms, we further distinguish subclasses such as textual vs. numerical action generators and explicit vs. implicit guidance mechanisms. We also summarize representative datasets and benchmarks for evaluating VLA-based driving systems and highlight key challenges and open directions, including robustness, interpretability, and instruction fidelity. Overall, this work aims to establish a coherent foundation for advancing human-compatible autonomous driving systems.
Abstract（参考訳）: 自律運転は長年、手作りのインターフェースとルールベースのコンポーネントが複雑なシナリオや長いシナリオに分解される、モジュラーな"知覚-決定-アクション"パイプラインに依存してきた。彼らのケースドデザインは、認識エラーをさらに伝播させ、下流の計画と制御を低下させる。ビジョン・アクション(VA)モデルは、視覚入力からアクションへの直接マッピングを学ぶことでいくつかの制限に対処するが、それらは不透明であり、分布シフトに敏感であり、構造的推論や命令追従能力に欠ける。大規模言語モデル(LLM)とマルチモーダル学習の最近の進歩は、言語に基づく意思決定と認識を統合するビジョン・ランゲージ・アクション(VLA)フレームワークの出現を動機付けている。視覚的理解、言語的推論、行動可能なアウトプットを統一することにより、VLAはより解釈可能で、一般化可能で、人間に沿った運転ポリシーへの道筋を提供する。この研究は、自律運転のための新たなVLA景観の構造化された特徴を提供する。我々は、初期のVAアプローチから現代のVLAフレームワークへの進化を辿り、既存のメソッドを2つの主要なパラダイムに分類する: 知覚、推論、計画を単一のモデルに統合するEnd-to-End VLAと、(VLMを通じて)遅い議論を(プランナーを介して)高速で安全に重要な実行から分離するDual-System VLA。これらのパラダイムの中では、テキスト対数値アクションジェネレータや明示対暗黙誘導機構といったサブクラスをさらに区別する。また、VLAベースの運転システムを評価するための代表的データセットとベンチマークを要約し、堅牢性、解釈可能性、命令忠実性など、重要な課題とオープンな方向性を強調します。この研究は、人間と互換性のある自動運転システムを構築するための一貫性のある基盤を確立することを目的としている。

論文の概要: Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

関連論文リスト