Related papers: Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis

Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis

URL: http://arxiv.org/abs/2512.16237v1
Date: Thu, 18 Dec 2025 06:30:08 GMT
Title: Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis
Authors: Zhi Helu, Huang Jingjing, Xu Wang, Xu Yangbin, Zhang Wanyue, Jiang Baoyang, Deng Shirui, Zhu Liang, Li Fangfang, Zhao Tiejun, Lin Yankai, Yao Yuan,
Abstract summary: Vision-Language Models (VLMs) are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable.<n>We introduce SP-RITE, a novel framework that overcomes this dilemma leveraging simulators and large models.<n>We have curated a dataset encompassing 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs.<n>We demonstrate that a VLM trained on our data achieves significant performance gains on multiple spatial benchmarks.
Score: 8.60591720958037
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied intelligence, a grand challenge in artificial intelligence, is fundamentally constrained by the limited spatial understanding and reasoning capabilities of current models. Prevailing efforts to address this through enhancing Vision-Language Models (VLMs) are trapped in a dilemma: template-based datasets are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable and, critically, computationally imprecise. We introduce SPRITE, a novel framework that overcomes this dilemma by leveraging simulators and large models to programmatically synthesize scalable, diverse, and high-quality spatial reasoning data. The core innovation of SPRITE is to reframe ground-truth generation as a code-generation task. We utilize LLMs to compile complex spatial questions into executable programs, which are then verified against high-precision scene meta-information extracted from simulators. This ensures our ground truth is both computationally precise and verifiable, while the generative power of LLMs provides vast linguistic diversity. Leveraging this pipeline, we have curated a dataset encompassing 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs. We demonstrate that a VLM trained on our data achieves significant performance gains on multiple spatial benchmarks and outperforms other open-source datasets of equivalent size. Furthermore, a scalability analysis confirms our hypothesis that overcoming the low-diversity nature of traditional template methods is essential for building robust, generalizable spatial intelligence. We will make the SPRITE framework code and the full 300k+ dataset publicly available to facilitate future research in spatial intelligence.

Related papers

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z)
Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions [18.455501447828343]
Spatial Intelligence (SI) has predominantly relied on Vision-Language Models (VLMs)<n>We introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input.<n>We find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential.
arXiv Detail & Related papers (2026-01-07T05:13:52Z)
From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs [65.04549036809557]
We introduce a benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors.<n>This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions.<n> Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings.
arXiv Detail & Related papers (2025-12-22T18:58:12Z)
ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints [42.713620384054146]
This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi-step spatial reasoning ability.<n>We propose four evaluation tasks: Pattern Prediction, Multi-step Spatial Reasoning, Spatial Relationship Prediction, and End-to-End CP Code Generation.
arXiv Detail & Related papers (2025-11-23T13:42:22Z)
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding [64.86119288520419]
multimodal language models struggle with spatial reasoning across time and space.<n>We present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators.<n>Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.
arXiv Detail & Related papers (2025-11-06T18:53:31Z)
SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters [0.0]
We present SmolRGPT, a compact vision-language architecture that explicitly incorporates region-level spatial reasoning.<n>SmolRGPT employs a three-stage curriculum that progressively align visual and language features, enables spatial relationship understanding, and adapts to task-specific datasets.<n>We demonstrate that with only 600M parameters, SmolRGPT achieves competitive results on challenging warehouse spatial reasoning benchmarks, matching or exceeding the performance of much larger alternatives.
arXiv Detail & Related papers (2025-09-18T23:55:51Z)
Can Large Language Models Integrate Spatial Data? Empirical Insights into Reasoning Strengths and Computational Weaknesses [11.330846631937671]
We explore the application of large language models (LLMs) to empower domain experts in integrating large, heterogeneous, and noisy urban spatial datasets.<n>We show that while LLMs exhibit spatial reasoning capabilities, they struggle to connect the macro-scale environment with the relevant computational geometry tasks.<n>We then adapt a review-and-refine method, which proves remarkably effective in correcting erroneous initial responses while preserving accurate responses.
arXiv Detail & Related papers (2025-08-07T03:44:20Z)
Improving Large Vision-Language Models' Understanding for Field Data [62.917026891829025]
We introduce FieldLVLM, a framework designed to improve large vision-language models' understanding of field data.<n>FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning.<n> Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data.
arXiv Detail & Related papers (2025-07-24T11:28:53Z)
Spatial Knowledge Graph-Guided Multimodal Synthesis [78.11669780958657]
We introduce a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation.<n>In experiments, data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly.<n>We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.
arXiv Detail & Related papers (2025-05-28T17:50:21Z)
Spatial-Temporal-Spectral Unified Modeling for Remote Sensing Dense Prediction [20.1863553357121]
Current deep learning architectures for remote sensing are fundamentally rigid.<n>We introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling.<n> STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands.<n>It unifies various dense prediction tasks and diverse semantic class predictions.
arXiv Detail & Related papers (2025-05-18T07:39:17Z)
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.