Fugu-MT 論文翻訳(概要): Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

論文の概要: Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

arxiv url: http://arxiv.org/abs/2602.21186v1
Date: Tue, 24 Feb 2026 18:37:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-25 17:34:53.884873
Title: Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
Title（参考訳）: Spa3R:3Dビジュアル推論のための予測空間場モデリング
Authors: Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang,
Abstract要約: 空間知能は、明示的な空間的インストラクションチューニングによって課されるのではなく、2次元視覚のみから現れる。本稿では,未提示のマルチビュー画像から直接,空間表現の統一化を学習する,自己教師型フレームワークであるSpa3Rを紹介する。実験では、Spa3-VLMが3D VQAで58.6%の最先端の精度を達成し、従来の方法よりも大幅に優れていた。
参考スコア（独自算出の注目度）: 43.746951848993035
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space--a cornerstone of spatial intelligence--remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the proposed Predictive Spatial Field Modeling (PSFM) paradigm, where Spa3R learns to synthesize feature fields for arbitrary unseen views conditioned on a compact latent representation, thereby internalizing a holistic and coherent understanding of the underlying 3D scene. We further integrate the pre-trained Spa3R Encoder into existing VLMs via a lightweight adapter to form Spa3-VLM, effectively grounding language reasoning in a global spatial context. Experiments on the challenging VSI-Bench demonstrate that Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods. These results highlight PSFM as a scalable path toward advancing spatial intelligence. Code is available at https://github.com/hustvl/Spa3R.
Abstract（参考訳）: VLM(Vision-Language Models)は、2次元の視覚的理解を示すが、空間知性の基盤である3次元空間の理解と推論能力は、表面的に残されている。現在の方法論はこの領域ギャップを、明示的な3Dモダリティに依存するか、あるいは部分的なビュー条件の幾何学的先行性を持つVLMを拡大することによって橋渡ししようとする。しかし、このようなアプローチはスケーラビリティを阻害し、最終的には、スパースキューから包括的3次元幾何学を暗黙的に再構築する不適切なタスクによって言語モデルに負担を与える。本稿では,空間的インテリジェンスを明示的な空間的インスツルメンテーションによって課されるのではなく,2次元視覚のみから生ずることができることを論じる。そこで本稿では,未提示のマルチビュー画像から直接,統一されたビュー不変空間表現を学習する自己教師型フレームワークであるSpa3Rを紹介する。 Spa3Rは、予測空間場モデリング(PSFM)パラダイムに基づいており、コンパクトな潜在表現で条件付けられた任意の未確認ビューのための特徴フィールドを合成することで、基礎となる3Dシーンの全体的かつ一貫性のある理解を内部化する。事前学習したSpa3Rエンコーダを、Spa3-VLMを形成するための軽量アダプタを介して既存のVLMに統合し、グローバル空間コンテキストにおける言語推論を効果的に基礎づける。挑戦的なVSI-Benchの実験により、Spa3-VLMは3D VQAで58.6%の最先端の精度を達成し、先行手法よりも大幅に優れていることが示された。これらの結果は、PSFMを空間知能の進歩に向けたスケーラブルな経路として強調する。コードはhttps://github.com/hustvl/Spa3R.comで入手できる。

論文の概要: Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

関連論文リスト