Fugu-MT 論文翻訳(概要): Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

論文の概要: Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

arxiv url: http://arxiv.org/abs/2603.06054v1
Date: Fri, 06 Mar 2026 09:07:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.477251
Title: Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving
Title（参考訳）: 自動車運転用軽量ビジョンランゲージモデルにおける視覚概念の提案
Authors: Nikos Theodoridis, Reenu Mohandas, Ganesh Sistu, Anthony Scanlan, Ciarán Eising, Tim Brophy,
Abstract要約: VLM(Vision-Language Models)は、自動走行アプリケーションに使用される。これらのモデルは、自動運転に非常に関係のある単純な視覚的な問題で失敗することが多い。シーンにおけるオブジェクトやエージェントの存在などの概念は、明示的にかつ線形に符号化されていることを示す。物体やエージェントの向きなどの他の空間視覚概念は、視覚エンコーダが保持する空間構造によって暗黙的に符号化される。最後に,対象物の距離を増大させると,対応する視覚概念の線形分離性が急速に低下することを示す。
参考スコア（独自算出の注目度）: 3.333320380836246
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene are explicitly and linearly encoded, whereas other spatial visual concepts, such as the orientation of an object or agent, are only implicitly encoded by the spatial structure retained by the vision encoder. In parallel, we observe that in certain cases, even when a concept is linearly encoded in the model's activations, the model still fails to answer correctly. This leads us to identify two failure modes. The first is perceptual failure, where the visual information required to answer a question is not linearly encoded in the model's activations. The second is cognitive failure, where the visual information is present but the model fails to align it correctly with language semantics. Finally, we show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept. Overall, our findings improve our understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.
Abstract（参考訳）: 自動運転アプリケーションにおけるビジョンランゲージモデル(VLM)の利用は、長い尾のシナリオを扱うための推論と一般化機能を活用することを目的として、ますます一般的になりつつある。しかしながら、これらのモデルは、自動運転に非常に関係のある単純な視覚的な問題で失敗することが多い。本研究では,視覚情報の流れのボトルネックを特定することを目的として,VLMの中間的活性化を検証し,特定の視覚概念が線形に符号化される範囲を評価する。具体的には,対象とする視覚的概念においてのみ異なる反実画像集合を作成し,次に4つの最先端(SOTA)VLMの活性化を用いて線形プローブを訓練する。その結果,映像中の物体やエージェントの存在などの概念は明示的に線形に符号化されているのに対し,物体やエージェントの向きなどの空間的視覚概念は視覚エンコーダが保持する空間的構造によって暗黙的に符号化されていることがわかった。平行して、ある場合において、ある概念がモデルのアクティベーションに線形に符号化されているとしても、モデルが正しく答えられないことが観察される。これにより、2つの障害モードが特定できます。 1つ目は知覚的失敗(perceptual failure)で、質問に答えるために必要な視覚情報はモデルのアクティベーションに線形にエンコードされない。 2つ目は認知的失敗であり、視覚情報は存在するが、モデルは言語の意味論と正しく一致しない。最後に,対象物の距離を増大させると,対応する視覚概念の線形分離性が急速に低下することを示す。以上より,自動走行に極めて関係のある単純な視覚的タスクにおいて,VLMにおける障害事例の理解を深めることができた。

論文の概要: Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

関連論文リスト