Fugu-MT 論文翻訳(概要): Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

論文の概要: Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

arxiv url: http://arxiv.org/abs/2605.30231v1
Date: Thu, 28 May 2026 17:00:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.566015
Title: Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
Title（参考訳）: 3次元VQAを超える:幾何学的推論のための視覚言語モデルに3次元空間的優先順位を注入する
Authors: Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma, Joseph Tighe, Fanyi Xiao,
Abstract要約: VLM(Vision-Language Models)は、しばしば堅牢な3次元空間推論に苦しむ。 GASP(Geometric Aware Spatial Priors)は,LLMのトランスフォーマー層に直接,基本的な幾何学的事前を注入するフレームワークである。 GASPは、すべての層にまたがる深い監視信号として応用された小さな対応ヘッドを使用し、地上の映像シーンを活用する2つの目的で訓練されている。
参考スコア（独自算出の注目度）: 24.251128353444713
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.
Abstract（参考訳）: VLM(Vision-Language Models)は、しばしば堅牢な3次元空間推論に苦しむ。 3Dビジュアル質問回答(VQA)データセットによる微調整に依存する一般的な方法は、データセット固有のバイアスに過度に適合する可能性がある一方で、特殊な3Dビジュアルエンコーダを統合することは、しばしば柔軟で扱いにくい。本稿では,高レベルのVQA監視だけでなく,基本的な幾何学的事前学習から真の空間的理解が生まれるべきだと論じる。 GASP(Geometric-Aware Spatial Priors)は,これらの事前情報をLCMのトランスフォーマー層に直接注入するフレームワークである。 GASPは、すべての層にまたがる深層監視信号として応用された小さな対応ヘッドを使用し、大規模映像シーンから地平線幾何を利用する2つの目的で訓練されている。まず、標準VLMの内部対応精度が非常に低い(しばしば5%未満)ことを診断する。そして、トレーニングによってこの挙動が大幅に改善され、ピーク層対応が70%以上、時間的ロバスト性が85%以上、ベースラインが5%以下であることが実証された。これらの内部改良は、VSI-Benchの+18.2%、VSI-Benchの+29.0%といった下流空間ベンチマークで大きく向上した。より信頼性の高い3次元空間推論により,基本的な幾何学的先行点からの学習がVLMへの有望かつ一般化可能な経路であることが示唆された。

論文の概要: Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

関連論文リスト