Fugu-MT 論文翻訳(概要): BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving

論文の概要: BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving

arxiv url: http://arxiv.org/abs/2606.08684v1
Date: Sun, 07 Jun 2026 15:37:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.389367
Title: BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving
Title（参考訳）: BLUE: 自律運転のための効率的な視覚・言語反応モデルにおける言語利用の改善に向けて
Authors: George Ling, Lijin Yang, Hao Yang, Zhongzhan Huang,
Abstract要約: 自律運転(AD)のための視覚行動モデル(VLA)における言語使用の最小化法であるBLUEを提案する。言語は少数のルートでのみ重要であるが、それらのルートでは性能を大幅に向上または劣化させる可能性がある。凍結したVLA隠れ状態の軽量ゲートをトレーニングし、各フレーム毎に言語生成を活性化するか、直接アクションを予測するかを決定する。
参考スコア（独自算出の注目度）: 18.22380132296319
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present BLUE, a minimal method for better language use in vision-language-action (VLA) models for autonomous driving (AD). Through extensive analysis, we reveal that language matters on only a small fraction of routes, but on those routes it can greatly improve or degrade performance. Generating language at every frame is therefore inefficient, since most computation is spent on frames that do not benefit from language. We further show that pretrained VLA hidden states potentially already encode whether language will benefit a given frame, even though scene complexity and kinematic features alone struggle to predict this. Based on this finding, BLUE trains a lightweight gate on frozen VLA hidden states to decide per frame whether to activate language generation or predict actions directly, without modifying the backbone or requiring additional human annotation. With just a 0.11M-parameter gate, BLUE sets a new state of the art on both benchmarks, achieving 76.2% success rate on Bench2Drive and 36 driving score on Longest6 v2, while delivering 2.54x inference speedup and 8.9% success rate improvement over the backbone. BLUE provides a practical path toward efficient language-augmented AD, showing that VLA models can retain the benefits of language at a fraction of the cost. Our code, data, logs and checkpoints are fully available on https://github.com/George-Ling3/BLUE.
Abstract（参考訳）: 本稿では,自律運転(AD)のための視覚言語アクション(VLA)モデルにおいて,言語使用量を改善する最小限の方法であるBLUEを提案する。広範に分析した結果,少数のルートでのみ言語が重要であることがわかったが,それらのルートでは性能が大幅に向上し,劣化する可能性がある。したがって、ほとんどの計算は、言語から恩恵を受けないフレームに費やされているため、各フレームでの言語生成は非効率である。さらに、事前訓練されたVLA隠蔽状態は、シーンの複雑さとキネマティック機能だけでは予測に苦慮しているにもかかわらず、既に言語が特定のフレームに利益をもたらすかどうかを符号化している可能性があることを示す。この発見に基づいて、BLUEは凍結したVLA隠れ状態の軽量ゲートをトレーニングし、バックボーンを変更したり、追加の人間のアノテーションを必要とせずに、各フレームで言語生成を活性化するか、アクションを直接予測するかを決定する。 BLUEは2つのベンチマークで0.11Mパラメータのゲートを新たに設定し、Bench2Driveで76.2%、Longest6 v2で36のドライブスコアを達成した。 BLUEは効率的な言語拡張ADへの実践的なパスを提供しており、VLAモデルが言語の利点をほんの少しのコストで維持できることを示している。私たちのコード、データ、ログ、チェックポイントはhttps://github.com/George-Ling3/BLUE.orgで完全に利用可能です。

論文の概要: BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving

関連論文リスト