Related papers: SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models

SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models

URL: http://arxiv.org/abs/2410.03878v1
Date: Fri, 4 Oct 2024 19:22:20 GMT
Title: SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
Authors: Yue Zhang, Zhiyang Xu, Ying Shen, Parisa Kordjamshidi, Lifu Huang,
Abstract summary: We introduce a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks. We also propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module.
Score: 45.28780381341979
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Integrating the 3D world into large language models (3D-based LLMs) has been a promising research direction for 3D scene understanding. However, current 3D-based LLMs fall short in situated understanding due to two key limitations: 1) existing 3D datasets are constructed from a global perspective of the 3D scenes and lack situated context. 2) the architectures of existing 3D-based LLMs lack explicit alignment between the spatial representations of 3D scenes and natural language, limiting their performance in tasks requiring precise spatial reasoning. We address these issues by introducing a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks. Furthermore, we propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module, aiming to enhance the alignment between 3D visual representations and their corresponding textual descriptions. Experimental results demonstrate that both our proposed dataset and alignment module significantly enhance the situated spatial understanding of 3D-based LLMs.

Related papers

Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models [12.545622346725544]
New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks.<n>We propose Spatial 3D-LLM, a 3D MLLM specifically designed to enhance spatial awareness for 3D vision-language tasks.<n>We introduce two novel tasks: 3D object distance measurement and 3D layout editing, and construct a 3D instruction dataset, MODEL, to evaluate the model's spatial awareness capabilities.
arXiv Detail & Related papers (2025-07-22T12:32:35Z)
MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation. We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z)
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding [19.382210260928776]
Video-3D LLM treats 3D scenes as dynamic videos and incorporates 3D position encoding into these representations. Our model achieves state-of-the-art performance on several 3D scene understanding benchmarks.
arXiv Detail & Related papers (2024-11-30T14:28:53Z)
LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image [72.14973729674995]
Current 3D perception methods, particularly small models, struggle with processing logical reasoning, question-answering, and handling open scenario categories. We propose solutions: Spatial-Enhanced Local Feature Mining for better spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations.
arXiv Detail & Related papers (2024-08-14T10:00:16Z)
When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [113.18524940863841]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs) It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z)
Unified Scene Representation and Reconstruction for 3D Large Language Models [40.693839066536505]
Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. We introduce Uni3DR2 extracts 3D geometric and semantic aware representation features via the frozen 2D foundation models. Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs.
arXiv Detail & Related papers (2024-04-19T17:58:04Z)
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding [12.823274886850697]
We introduce a novel and efficient prompt tuning paradigm, 3DMIT. This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information. We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain.
arXiv Detail & Related papers (2024-01-06T12:20:18Z)
LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding [36.66305190056456]
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown promise in instruction following and 2D image understanding. In this paper, we introduce LiDAR-LLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs. The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem.
arXiv Detail & Related papers (2023-12-21T17:52:12Z)
Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes [56.727745047799246]
3D scene understanding has gained significant attention due to its wide range of applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs.
arXiv Detail & Related papers (2023-08-17T03:52:15Z)
3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z)
Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic Segmentation [87.54570024320354]
State-of-the-art methods for large-scale driving-scene LiDAR semantic segmentation often project and process the point clouds in the 2D space. A straightforward solution to tackle the issue of 3D-to-2D projection is to keep the 3D representation and process the points in the 3D space. We develop a 3D cylinder partition and a 3D cylinder convolution based framework, termed as Cylinder3D, which exploits the 3D topology relations and structures of driving-scene point clouds.
arXiv Detail & Related papers (2020-08-04T13:56:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.