Fugu-MT 論文翻訳(概要): Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

論文の概要: Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

arxiv url: http://arxiv.org/abs/2511.10946v1
Date: Fri, 14 Nov 2025 04:16:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-17 22:42:18.429565
Title: Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
Title（参考訳）: 視覚言語モデルにおける空間情報のための抽象的3次元知覚
Authors: Yifan Liu, Fangneng Zhan, Kaichen Zhou, Yilun Du, Paul Pu Liang, Hanspeter Pfister,
Abstract要約: 視覚言語モデル(VLM)は、空間認識や物理的理解といった3D関連課題に苦しむ。我々は,VLMの幾何学的構造と物理力学を符号化するために,抽象的境界ボックスを利用するフレームワークであるSandboxVLMを紹介した。提案手法は空間知能を常に向上させ,SAT Realの8.3%のゲインをベースライン法と比較して達成する。
参考スコア（独自算出の注目度）: 100.13033631690114
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3\% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.
Abstract（参考訳）: 視覚言語モデル(VLM)は、空間認知や物理的理解などの3D関連タスクに苦しむ。本研究は,3次元課題とVLMの2次元トレーニングの間にはモダリティギャップがあり,それによって2次元入力から3次元情報の非効率な検索が可能となった。このギャップを埋めるために、VLMの幾何学的構造と物理力学をエンコードするために抽象的境界ボックスを利用するシンプルで効果的なフレームワークであるSandboxVLMを紹介した。具体的には、3次元サンドボックス再構成と認識パイプラインを設計し、抽象的な制御による複数ビュー先行生成、プロキシの上昇、多視点投票とクラスタリング、そして3D認識推論の4段階からなる。複数のベンチマークとVLMバックボーンのゼロショット設定で評価され、我々のアプローチは一貫して空間知性を改善し、例えばベースライン手法と比較してSAT Realの8.3倍のゲインを達成する。これらの結果から,VLMを3次元抽象的に装備することで,付加的なトレーニングを伴わない3次元推論能力が大幅に向上し,汎用的なインボディードインテリジェンスの可能性が示唆された。

論文の概要: Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

関連論文リスト