G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
- URL: http://arxiv.org/abs/2511.21688v2
- Date: Thu, 27 Nov 2025 07:58:27 GMT
- Title: G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
- Authors: Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, Jiangmiao Pang,
- Abstract summary: Vision-Language Models (VLMs) still lack robustness in spatial intelligence.<n>We present G$2$VLM, a vision-language model that bridges two fundamental aspects of spatial intelligence.
- Score: 36.62798449863548
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
Related papers
- Think3D: Thinking with Space for Spatial Reasoning [54.518667686880114]
We introduce Think3D, a framework that enables vision large models (VLMs) to think with 3D space.<n>Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models.<n>Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents.
arXiv Detail & Related papers (2026-01-19T13:13:54Z) - Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models [0.0]
ZING-3D is a framework that generates a rich semantic representation of a 3D scene in a zero-shot manner.<n>It also enables incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications.<n>Our experiments on scenes from the Replica and HM3D dataset show that ZING-3D is effective at capturing spatial and relational knowledge without the need of task-specific training.
arXiv Detail & Related papers (2025-10-24T00:52:33Z) - UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding [65.60549881706959]
We introduce UniUGG, the first unified understanding and generation framework for 3D modalities.<n>Our framework employs an LLM to comprehend and decode sentences and 3D representations.<n>We propose a spatial decoder leveraging a latent diffusion model to generate high-quality 3D representations.
arXiv Detail & Related papers (2025-08-16T07:27:31Z) - 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation [17.294440057314812]
Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks.<n>We propose Geometric Distillation, a framework that injects human-inspired geometric cues into pretrained VLMs.<n>Our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs.
arXiv Detail & Related papers (2025-06-11T15:56:59Z) - VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.