Fugu-MT 論文翻訳(概要): SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

論文の概要: SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

arxiv url: http://arxiv.org/abs/2603.27437v1
Date: Sat, 28 Mar 2026 22:49:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.956334
Title: SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
Title（参考訳）: 空間Stack:3次元VLM空間共振のための層状幾何学・言語融合
Authors: Jiang Zhang, Shijie Zhou, Bangya Liu, Achuta Kadambi, Zhiwen Fan,
Abstract要約: 大規模な視覚言語モデル(VLM)は、まだ信頼性の高い3次元空間推論に苦戦している。本研究では,階層的な融合フレームワークであるSpatialStackを提案する。この枠組みに基づいて,複数次元空間推論ベンチマークにおける最先端性能を実現するモデル VLM-SpatialStack を開発した。
参考スコア（独自算出の注目度）: 22.547972947051765
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.
Abstract（参考訳）: 大規模な視覚言語モデル(VLM)は、まだ信頼性の高い3次元空間推論に苦戦している。この制限は、微細な3次元幾何学と空間的関係を捉えることができないことから生じる。近年の取り組みでは、多視点幾何変換器をVLMに導入しているが、通常は視覚や幾何エンコーダの深層構造のみを融合させ、リッチな階層的信号を捨て、空間的理解のための基本的なボトルネックを創り出す。これを解決するために、モデル階層全体にわたって視覚、幾何学、言語表現を段階的に整合させる一般的な階層的融合フレームワークであるSpatialStackを提案する。従来の後期の視覚幾何学融合を超えて、SpatialStackは言語バックボーンと多段階の幾何学的特徴を同期させ、局所的な幾何学的精度とグローバルな文脈的意味論の両方を捉えることができる。この枠組みに基づいて,複数次元空間推論ベンチマークにおける最先端性能を実現するモデル VLM-SpatialStack を開発した。大規模な実験と改善により、我々のマルチレベル融合戦略は、さまざまな空間的推論タスクにおける3D理解を一貫して強化し、堅牢に一般化し、次世代のマルチモーダル物理AIシステムにおける視覚-言語-幾何学統合のための効果的で拡張可能な設計パラダイムとしてSpatialStackを確立します。

論文の概要: SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

関連論文リスト