Think3D: Thinking with Space for Spatial Reasoning
- URL: http://arxiv.org/abs/2601.13029v1
- Date: Mon, 19 Jan 2026 13:13:54 GMT
- Title: Think3D: Thinking with Space for Spatial Reasoning
- Authors: Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, Lijun Wang, Huchuan Lu,
- Abstract summary: We introduce Think3D, a framework that enables vision large models (VLMs) to think with 3D space.<n>Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models.<n>Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents.
- Score: 54.518667686880114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding and reasoning about the physical world requires spatial intelligence: the ability to interpret geometry, perspective, and spatial relations beyond 2D perception. While recent vision large models (VLMs) excel at visual understanding, they remain fundamentally 2D perceivers and struggle with genuine 3D reasoning. We introduce Think3D, a framework that enables VLM agents to think with 3D space. By leveraging 3D reconstruction models that recover point clouds and camera poses from images or videos, Think3D allows the agent to actively manipulate space through camera-based operations and ego/global-view switching, transforming spatial reasoning into an interactive 3D chain-of-thought process. Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models such as GPT-4.1 and Gemini 2.5 Pro, yielding average gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. We further show that smaller models, which struggle with spatial exploration, benefit significantly from a reinforcement learning policy that enables the model to select informative viewpoints and operations. With RL, the benefit from tool usage increases from +0.7% to +6.8%. Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents, establishing a new dimension of multimodal intelligence. Code and weights are released at https://github.com/zhangzaibin/spagent.
Related papers
- G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning [36.62798449863548]
Vision-Language Models (VLMs) still lack robustness in spatial intelligence.<n>We present G$2$VLM, a vision-language model that bridges two fundamental aspects of spatial intelligence.
arXiv Detail & Related papers (2025-11-26T18:59:39Z) - Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction [13.098585993121722]
We present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch.<n>Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction.<n>Experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency.
arXiv Detail & Related papers (2025-09-25T22:24:23Z) - Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting [64.64738535860351]
We present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations.<n>Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding.<n>By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence.
arXiv Detail & Related papers (2025-07-24T14:53:26Z) - MindJourney: Test-Time Scaling with World Models for Spatial Reasoning [97.61985090279961]
We propose MindJourney, a test-time scaling framework for vision-language models.<n>We show that MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT.<n>Our method also improves upon the test-time inference VLMs trained through reinforcement learning.
arXiv Detail & Related papers (2025-07-16T17:59:36Z) - 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark [25.311698492216127]
3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space.<n>Large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks.<n>We present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs.
arXiv Detail & Related papers (2024-12-10T18:55:23Z) - LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.<n>In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.<n>We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - On the Efficacy of 3D Point Cloud Reinforcement Learning [20.4424883945357]
We focus on 3D point clouds, one of the most common forms of 3D representations.
We systematically investigate design choices for 3D point cloud RL, leading to the development of a robust algorithm for various robotic manipulation and control tasks.
We find that 3D point cloud RL can significantly outperform the 2D counterpart when agent-object / object-object relationship encoding is a key factor.
arXiv Detail & Related papers (2023-06-11T22:52:08Z) - SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for
Spatial-Aware Visual Representations [85.38562724999898]
We propose a 2D Image and 3D Point cloud Unsupervised pre-training strategy, called SimIPU.
Specifically, we develop a multi-modal contrastive learning framework that consists of an intra-modal spatial perception module and an inter-modal feature interaction module.
To the best of our knowledge, this is the first study to explore contrastive learning pre-training strategies for outdoor multi-modal datasets.
arXiv Detail & Related papers (2021-12-09T03:27:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.