Fugu-MT 論文翻訳(概要): CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming

論文の概要: CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming

arxiv url: http://arxiv.org/abs/2606.22476v1
Date: Sun, 21 Jun 2026 12:35:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 18:05:38.27687
Title: CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming
Title（参考訳）: CVSBench: クロスビューな空間推論とドリームのための総合ベンチマーク
Authors: Ruixun Liu, Lingyu Zhang, Lanxuan Xue, Kaiyu Li, Bowen Fu, Xiangyong Cao,
Abstract要約: CVSBenchは、衛星とストリートのペアによる空間的推論を評価するための大規模なベンチマークである。このベンチマークは、クロスビューVQA、クロスビューグラウンド、視点識別など、複数のタスクをサポートする。言語のみの推論は,視覚空間の想像力を3次元シーンの想像パイプラインに組み込むことで,視線間の推論を大幅に改善する一方で,限界的な改善をもたらすことを示した。
参考スコア（独自算出の注目度）: 13.534076118011603
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans can effortlessly reason about scenes across different viewpoints, yet it remains unclear whether Vision-Language Models (VLMs) possess similar cross-view spatial abilities. Satellite-street scene pairs, with their complex contexts and extreme viewpoint variations, provide an ideal testbed. Motivated by this, we introduce CVSBench, a large-scale benchmark for evaluating cross-view spatial reasoning through satellite-street pairs. This benchmark supports multiple tasks, including cross-view VQA, cross-view grounding, and viewpoint identification. CVSBench comprises 3,297 cross-view image groups with 9,468 object-level annotations and 40,679 question-answer (QA) pairs, enabling systematic and controlled evaluation of cross-view spatial reasoning. Extensive evaluations reveal that advanced VLMs struggle to maintain object-level and layout consistency under drastic viewpoint changes. To bridge this gap towards human-like spatial cognition, we investigate two categories of approaches: spatially grounded reasoning and the incorporation of cognitive map inputs. Our findings demonstrate that language-only reasoning yields marginal improvements, while incorporating visual spatial imagination via a 3D scene imagination pipeline substantially improves cross-view reasoning. These results highlight the necessity of explicit visual-spatial representations for robust spatial cognition in VLMs. Our data and code are released at https://huggingface.co/datasets/zlyzlyzly/CVSBench.
Abstract（参考訳）: 人間は、異なる視点のシーンについて熱心に推論することができるが、視覚言語モデル(VLM)が同様の視野空間能力を持っているかどうかは不明だ。衛星とストリートのシーンペアは、複雑なコンテキストと極端な視点のバリエーションを持ち、理想的なテストベッドを提供する。そこで我々は,衛星とストリートのペアによる空間的推論を評価するための大規模ベンチマークであるCVSBenchを紹介した。このベンチマークは、クロスビューVQA、クロスビューグラウンド、視点識別など、複数のタスクをサポートする。 CVSBenchは、3,297のクロスビュー画像群と9,468のオブジェクトレベルアノテーションと40,679のQAペアで構成され、クロスビュー空間推論の体系的および制御された評価を可能にする。広範囲な評価により、高度なVLMは、劇的な視点の変化の下でオブジェクトレベルとレイアウトの整合性を維持するのに苦労していることが明らかとなった。このギャップを人間のような空間認知へ橋渡しするために,空間的根拠に基づく推論と認知地図入力の導入の2つのカテゴリについて検討する。言語のみの推論は,視覚空間の想像力を3次元シーンの想像パイプラインに組み込むことで,視線間の推論を大幅に改善する一方で,限界的な改善をもたらすことを示した。これらの結果は,VLMにおける空間認知のための視覚空間表現の必要性を浮き彫りにした。私たちのデータとコードはhttps://huggingface.co/datasets/zlyzlyzly/CVSBench.orgで公開されています。

論文の概要: CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming

関連論文リスト