SpatialLM: Training Large Language Models for Structured Indoor Modeling
- URL: http://arxiv.org/abs/2506.07491v1
- Date: Mon, 09 Jun 2025 07:10:58 GMT
- Title: SpatialLM: Training Large Language Models for Structured Indoor Modeling
- Authors: Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, Zihan Zhou,
- Abstract summary: SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs.<n>We collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes with ground-truth 3D annotations.<n>Our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection.
- Score: 34.0957676434764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.
Related papers
- Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis [44.66179436245703]
Follow-Your-Instruction is a framework for automatically synthesizing high-quality 2D, 3D, and 4D data.<n>It constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement.<n>We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks.
arXiv Detail & Related papers (2025-08-07T17:12:54Z) - Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence [13.168559963356952]
We present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations.<n>Our key insight is to unleash the strong structure prior to the feed-forward visual geometry foundation model.<n>A connector then integrates both features into unified visual tokens for enhanced spatial understanding.
arXiv Detail & Related papers (2025-05-29T17:59:04Z) - VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z) - MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.<n>Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.<n>We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z) - Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding [19.382210260928776]
Video-3D LLM treats 3D scenes as dynamic videos and incorporates 3D position encoding into these representations.<n>Our model achieves state-of-the-art performance on several 3D scene understanding benchmarks.
arXiv Detail & Related papers (2024-11-30T14:28:53Z) - ROOT: VLM based System for Indoor Scene Understanding and Beyond [83.71252153660078]
ROOT is a VLM-based system designed to enhance the analysis of indoor scenes.
rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI.
arXiv Detail & Related papers (2024-11-24T04:51:24Z) - LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.<n>In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.<n>We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input.
Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints.
We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model [58.24851949945434]
LLplace is a novel 3D indoor scene layout designer based on lightweight fine-tuned open-source LLM Llama3.
LLplace circumvents the need for spatial relationship priors and in-context exemplars, enabling efficient and credible room layout generation.
Our approach demonstrates that LLplace can effectively generate and edit 3D indoor layouts interactively and outperform existing methods in delivering high-quality 3D design solutions.
arXiv Detail & Related papers (2024-06-06T08:53:01Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.