Fugu-MT 論文翻訳(概要): Listen, Look, Drive: Coupling Audio Instructions for User-aware VLA-based Autonomous Driving

論文の概要: Listen, Look, Drive: Coupling Audio Instructions for User-aware VLA-based Autonomous Driving

arxiv url: http://arxiv.org/abs/2601.12142v3
Date: Thu, 29 Jan 2026 09:11:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.819331
Title: Listen, Look, Drive: Coupling Audio Instructions for User-aware VLA-based Autonomous Driving
Title（参考訳）: リスナー、ルック、ドライブ:VLAに基づく自律運転のためのオーディオインストラクションの結合
Authors: Ziang Guo, Feng Yang, Xuefeng Zhang, Jiaqi Guo, Kun Zhao, Yixiao Zhou, Peng Lu, Sifa Zheng, Zufeng Zhang,
Abstract要約: 本稿では、カメラストリームとその場での音声指示を結合するユーザ対応VLAであるEchoVLAを紹介する。音声強調データセットを、対応する運転行動と組み合わせた異なる感情型で合成する。オープンループベンチマークでは, 平均L2誤差を59.4%, 衝突速度を74.4%削減する。
参考スコア（独自算出の注目度）: 16.800050024086953
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision Language Action (VLA) models promise an open-vocabulary interface that can translate perceptual ambiguity into semantically grounded driving decisions, yet they still treat language as a static prior fixed at inference time. As a result, the model must infer continuously shifting objectives from pixels alone, yielding delayed or overly conservative maneuvers. We argue that effective VLAs for autonomous driving need an online channel in which users can influence driving with specific intentions. To this end, we present EchoVLA, a user-aware VLA that couples camera streams with in situ audio instructions. We augment the nuScenes dataset with temporally aligned, intent-specific speech commands generated by converting ego-motion descriptions into synthetic audios. Further, we compose emotional speech-trajectory pairs into a multimodal Chain-of-Thought (CoT) for fine-tuning a Multimodal Large Model (MLM) based on Qwen2.5-Omni. Specifically, we synthesize the audio-augmented dataset with different emotion types paired with corresponding driving behaviors, leveraging the emotional cues embedded in tone, pitch, and speech tempo to reflect varying user states, such as urgent or hesitant intentions, thus enabling our EchoVLA to interpret not only the semantic content but also the emotional context of audio commands for more nuanced and emotionally adaptive driving behavior. In open-loop benchmarks, our approach reduces the average L2 error by $59.4\%$ and the collision rate by $74.4\%$ compared to the baseline of vision-only perception. More experiments on nuScenes dataset validate that EchoVLA not only steers the trajectory through audio instructions, but also modulates driving behavior in response to the emotions detected in the user's speech.
Abstract（参考訳）: 視覚言語アクション(VLA)モデルは、知覚的あいまいさを意味論的に根拠付けられた駆動決定に翻訳できるオープンな語彙インターフェースを約束するが、それでも推論時に静的な事前固定として言語を扱います。結果として、モデルは連続的に変化する目標をピクセル単独から推測し、遅延または過度に保守的な操作をもたらす必要がある。自動運転に有効なVLAは、ユーザが特定の意図で運転に影響を与えることができるオンラインチャネルが必要である、と我々は主張する。この目的のために、私たちは、カメラストリームとその場での音声指示を結合するユーザ認識VLAであるEchoVLAを紹介します。我々は、エゴモーション記述を合成音声に変換することで、時間的に整列した意図特異的音声コマンドでnuScenesデータセットを増強する。さらに、Qwen2.5-Omniに基づくマルチモーダル大モデル(MLM)を微調整するために、感情的音声軌道対をマルチモーダル・チェーン・オブ・ソート(CoT)に構成する。具体的には、声調、ピッチ、発話テンポに埋め込まれた感情的手がかりを利用して、対応する運転行動と組み合わされた異なる感情型で音声強調データセットを合成し、緊急または難解な意図などの様々なユーザ状態を反映し、EchoVLAが意味的内容だけでなく、よりニュアンスで感情に適応した運転行動のための音声コマンドの感情的文脈を解釈できるようにする。オープンループベンチマークでは、我々の手法は平均L2誤差を59.4 %$、衝突率を74.4 %$に下げる。 nuScenesデータセットに関するさらなる実験は、EchoVLAが音声命令を通じて軌道を操縦するだけでなく、ユーザーの音声で検出された感情に応じて運転行動を調整することを検証している。

論文の概要: Listen, Look, Drive: Coupling Audio Instructions for User-aware VLA-based Autonomous Driving

関連論文リスト