Fugu-MT 論文翻訳(概要): Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

論文の概要: Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

arxiv url: http://arxiv.org/abs/2603.09512v1
Date: Tue, 10 Mar 2026 11:12:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.248253
Title: Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning
Title（参考訳）: 運転用VLMの信頼性:不整合応答から接地時間共振まで
Authors: Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain Pagani,
Abstract要約: 運転支援システムとして応用されたビジョン・ランゲージ・モデル(VLM)は,現状の観測結果が今後の成果をどう形作るのかを概説し,理解することができるかを検討する。強い視覚的理解を持つモデルは、時間的推論を必要とするタスクにおいて必ずしも最善を尽くさない。本稿では、時間ラベルを必要とせず、一貫性と時間的推論の両方を改善するチェーン・オブ・ソート推論を用いた、シンプルで効果的な自己教師付きチューニング手法を提案する。
参考スコア（独自算出の注目度）: 17.08518699175473
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing, and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach with chain-of-thought reasoning that improves both consistency and temporal reasoning without requiring temporal labels.
Abstract（参考訳）: 信頼性の高い運転支援者は、観測情報から得られた時間的根拠に基づく推論に基づいて一貫した応答を提供するべきである。本研究では、運転支援システムとして応用された視覚言語モデル(VLM)が、現在の観察結果がどのように将来の成果を形作っているか、あるいは、その出力が時間的根拠のない学習中に記憶されたパターンを単に反映しているだけなのかを考察する。近年の取り組みはVLMを自律運転に統合しているが、先行研究はシーン理解と命令生成を重視しており、強い視覚的解釈が自然に将来の推論を可能とし、信頼性の高い意思決定を保証することを暗黙的に仮定している。この設定では、VLMの信頼性を制限する2つの大きな課題に焦点をあてる:応答の不整合、小さな入力摂動が異なる回答を得る場合、場合によっては、ほぼランダムな推測に対する応答が縮退し、時間的推論が制限される場合、モデルが現在の観測から逐次的なイベントを推論および整列に失敗し、しばしば誤った、あるいは矛盾する応答をもたらす場合である。さらに、時間的推論を必要とするタスクにおいて、強い視覚的理解を持つモデルは、時間的ダイナミクスをモデル化するよりも、事前学習されたパターンを過度に重視する傾向があることを示唆する。これらの問題に対処するために、我々は既存の評価手法を採用し、将来のシーン推論を評価するために特別に設計された人手によるベンチマークデータセットであるFutureVQAを導入する。さらに、時間ラベルを必要とせず、一貫性と時間的推論の両方を改善するチェーン・オブ・ソート推論を用いた簡易かつ効果的な自己教師付きチューニング手法を提案する。

論文の概要: Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

関連論文リスト