Fugu-MT 論文翻訳(概要): Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

論文の概要: Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

arxiv url: http://arxiv.org/abs/2605.30557v1
Date: Thu, 28 May 2026 20:44:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-01 20:56:50.22599
Title: Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
Title（参考訳）: VLMは空間的疑問に答えない時(そしてなぜ)を知るのか?
Authors: Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton, Idan Szpektor, Mohit Bansal,
Abstract要約: 空間推論は視覚言語モデル(VLM)が現実世界の環境に展開する基本的な能力である。対象情報を隠蔽するオクルージョン(Occlusion)と、誤解を招く視覚的手がかりを生成する視点曖昧性(spective ambiguity)の2つのタイプの観察課題を紹介した。各構成について、クリーンな観察では答えられるが、導入した課題では無視する必要がある空間的質問を設計する。
参考スコア（独自算出の注目度）: 72.2500547961037
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.
Abstract（参考訳）: 空間推論は視覚言語モデル(VLM)が現実世界の環境に展開する基本的な能力である。しかし、視覚的な観察は本質的に3D世界の限られた表現であり、オクルージョンは物体を見えなくし、視界が幾何学的性質を誤解させる可能性がある。それにもかかわらず、既存の空間的推論ベンチマークでは、ある質問がいつ答えられないかや、追加の観測が必要かどうかを認識するよりも、モデルが正しい回答を得られるかどうかに焦点を当て、観察が十分で信頼できると仮定している。本研究では,制御された評価フレームワークであるSpatialUncertainを構築し,(1)対象情報を隠蔽するオクルージョン,(2)誤解を招く視覚的手がかりを生成する視点曖昧性という2種類の観察課題を導入することで,この仮定に挑戦する。各構成について、クリーンな観察では答えられるが、導入した課題では無視する必要がある空間的質問を設計する。さらに、どの視点が視点の曖昧さを解消するかをモデルが特定できるかどうかについても検討する。我々の結果は、フロンティアとクローズドソースのVLMの多様なセットにまたがって、2つの一貫した障害モードを明らかにします。第一に、モデルは過度に確信し、視覚的証拠が不完全な場合や誤解を招く場合であっても、空間的推論タスクを解こうとします。第二に、追加のビューが利用可能であったとしても、信頼できる証拠を提供するものを特定するために、ランダムに近い確率で実行するモデルもある。調査の結果は、モデルがいつ禁じるべきか、そして信頼できる証拠を探す方法を知るかどうかを評価するために、答えの正しさを超えることを求めている。

論文の概要: Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

関連論文リスト