pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
- URL: http://arxiv.org/abs/2603.00905v1
- Date: Sun, 01 Mar 2026 03:55:49 GMT
- Title: pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
- Authors: Zhanpeng Luo, Ce Zhang, Silong Yong, Cunxi Dai, Qianwei Wang, Haoxi Ran, Guanya Shi, Katia Sycara, Yaqi Xie,
- Abstract summary: pySpatial is a visual programming framework that equips MLLMs with the ability to interface with spatial tools.<n> pySpatial converts raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations.
- Score: 18.697914587954163
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach.
Related papers
- S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance [20.55536735670125]
3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions.<n>Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG.<n>We propose S$2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning.
arXiv Detail & Related papers (2025-12-01T03:08:34Z) - Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting [16.896443736904356]
Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions.<n>We introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation.<n>Our framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer.
arXiv Detail & Related papers (2025-10-18T08:53:08Z) - RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics [67.11221574129937]
Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world.<n>We propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding.<n>RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning.
arXiv Detail & Related papers (2025-06-04T17:59:27Z) - VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models [9.279591094901152]
SORT3D is an approach that utilizes rich object attributes from 2D data and merges as-based spatial reasoning toolbox with the ability of large language models (LLMs) to perform sequential reasoning.<n>We show that SORT3D state-of-the-art zero-shot performance on complex view-dependent grounding tasks on two benchmarks.<n>We also implement the pipeline to run real-time on two autonomous vehicles and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments.
arXiv Detail & Related papers (2025-04-25T20:24:11Z) - LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.<n>In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.<n>We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input.
Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints.
We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - Language-Image Models with 3D Understanding [59.499585515469974]
We develop a large-scale pre-training dataset for 2D and 3D called LV3D.
Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D.
We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective.
arXiv Detail & Related papers (2024-05-06T17:57:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.