Fugu-MT 論文翻訳(概要): SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios

論文の概要: SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios

arxiv url: http://arxiv.org/abs/2602.08440v2
Date: Fri, 13 Feb 2026 08:14:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.278353
Title: SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios
Title（参考訳）: SteerVLA:長距離運転シナリオにおけるステアリング・ビジョン・ランゲージ・アクションモデル
Authors: Tian Gao, Celine Tan, Catherine Glossop, Timothy Gao, Jiankai Sun, Kyle Stachowicz, Shirley Wu, Oier Mees, Dorsa Sadigh, Sergey Levine, Chelsea Finn,
Abstract要約: 自律運転における基本的な課題は、ロングテールイベントに対する高レベルなセマンティック推論と、ロバストな運転のための低レベルでリアクティブな制御の統合である。本稿では,視覚-言語-行動駆動ポリシーを操る細粒度言語命令を生成するSteerVLAを提案する。我々は、SteerVLAを挑戦的なクローズドループベンチマークで評価し、運転スコア全体の4.77ポイント、ロングテールサブセットの8.04ポイントで最先端の手法より優れています。
参考スコア（独自算出の注目度）: 104.10555123175055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A fundamental challenge in autonomous driving is the integration of high-level, semantic reasoning for long-tail events with low-level, reactive control for robust driving. While large vision-language models (VLMs) trained on web-scale data offer powerful common-sense reasoning, they lack the grounded experience necessary for safe vehicle control. We posit that an effective autonomous agent should leverage the world knowledge of VLMs to guide a steerable driving policy toward robust control in driving scenarios. To this end, we propose SteerVLA, which leverages the reasoning capabilities of VLMs to produce fine-grained language instructions that steer a vision-language-action (VLA) driving policy. Key to our method is this rich language interface between the high-level VLM and low-level VLA, which allows the high-level policy to more effectively ground its reasoning in the control outputs of the low-level policy. To provide fine-grained language supervision aligned with vehicle control, we leverage a VLM to augment existing driving data with detailed language annotations, which we find to be essential for effective reasoning and steerability. We evaluate SteerVLA on a challenging closed-loop benchmark, where it outperforms state-of-the-art methods by 4.77 points in overall driving score and by 8.04 points on a long-tail subset. The project website is available at: https://steervla.github.io/.
Abstract（参考訳）: 自律運転における基本的な課題は、ロングテールイベントに対する高レベルなセマンティック推論と、ロバストな運転のための低レベルでリアクティブな制御の統合である。 Webスケールのデータに基づいて訓練された大規模な視覚言語モデル(VLM)は強力な常識推論を提供するが、安全な車両制御に必要な基礎的な経験は欠如している。実効的な自律エージェントは、VLMの世界的知識を活用して、運転シナリオの堅牢な制御に向けて、ステアブルな運転ポリシーを導出すべきであると仮定する。そこで本研究では,VLMの推論機能を活用して,視覚言語アクション(VLA)駆動ポリシを操る詳細な言語命令を生成するSteerVLAを提案する。提案手法の鍵となるのは,高レベルなVLMと低レベルなVLAとの間のリッチ言語インタフェースである。車両制御に整合した粒度の細かい言語管理を実現するため,VLMを利用して詳細な言語アノテーションで既存の運転データを拡張し,効果的な推論と操縦性に欠かせないものと考えられる。我々は、SteerVLAを挑戦的なクローズドループベンチマークで評価し、運転スコア全体の4.77ポイント、ロングテールサブセットの8.04ポイントで最先端の手法より優れています。プロジェクトのWebサイトは以下の通りである。

論文の概要: SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios

関連論文リスト