SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
- URL: http://arxiv.org/abs/2510.09606v1
- Date: Fri, 10 Oct 2025 17:59:46 GMT
- Title: SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
- Authors: Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue,
- Abstract summary: This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges.<n>The heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation.<n>We introduce a holistic solution that integrates a structured spatial reasoning system, scale-aware modeling, and a progressive training paradigm.
- Score: 43.506658643163405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .
Related papers
- MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence [61.065486539729875]
MMSI-Video-Bench is a fully human-annotated benchmark for video-based spatial intelligence in MLLMs.<n>It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips.<n>We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap.
arXiv Detail & Related papers (2025-12-11T17:57:24Z) - Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks [108.15756345836901]
We provide a comprehensive review of multimodal spatial reasoning tasks with large models.<n>We review advances in embodied AI, including vision-language navigation and action models.<n>We consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors.
arXiv Detail & Related papers (2025-10-29T17:55:43Z) - InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy [138.89177083578213]
We introduce InternVLA-M1, a unified framework for spatial grounding and robot control.<n>InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data, and (ii) spatially guided action post-training.<n>Results: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka.
arXiv Detail & Related papers (2025-10-15T17:30:05Z) - SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models [75.64836077468722]
Vision language models (VLMs) excel in 2D semantic visual understanding, but their ability to quantitatively reason about 3D spatial relationships remains under-explored.<n>We propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs.<n>We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability.
arXiv Detail & Related papers (2025-09-22T12:08:12Z) - Spatial-ORMLLM: Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model [1.8302608976873713]
Spatial-ORMLLM is a vision-language model for 3D spatial reasoning in operating rooms.<n>It incorporates 2D modality inputs with rich 3D spatial knowledge extracted by the estimation algorithm.<n>It delivers robust 3D scene reasoning without any additional expert annotations or sensor inputs.
arXiv Detail & Related papers (2025-08-11T17:17:20Z) - LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks [22.011855291417856]
It remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement.<n>In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark.<n>We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement.
arXiv Detail & Related papers (2025-07-27T08:31:24Z) - Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos [66.62109400603394]
We introduce Being-H0, a dexterous Vision-Language-Action model trained on large-scale human videos.<n>Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks.<n>We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes.
arXiv Detail & Related papers (2025-07-21T13:19:09Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs [19.70116190496693]
Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space.<n>In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes.
arXiv Detail & Related papers (2025-03-17T12:34:22Z) - SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model [72.0795843450604]
Current approaches face challenges in maintaining consistent accuracy across diverse scenes.
These methods rely on extensive datasets comprising millions, if not tens of millions, of data for training.
This paper presents SM$4$Depth, a model that seamlessly works for both indoor and outdoor scenes.
arXiv Detail & Related papers (2024-03-13T14:08:25Z) - SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities [59.39858959066982]
understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics.
We develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images.
By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA.
arXiv Detail & Related papers (2024-01-22T18:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.