SceneGPT: A Language Model for 3D Scene Understanding
- URL: http://arxiv.org/abs/2408.06926v1
- Date: Tue, 13 Aug 2024 14:26:30 GMT
- Title: SceneGPT: A Language Model for 3D Scene Understanding
- Authors: Shivam Chandhok,
- Abstract summary: We present SceneGPT, an LLM based scene understanding system which can perform 3D spatial reasoning without training or explicit 3D supervision.
Key components of our framework are - 1) a 3D scene graph, that serves as scene representation, encoding the objects in the scene and their spatial relationships 2) a pre-trained LLM that can be adapted with in context learning for 3D spatial reasoning.
- Score: 0.9054540533394926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building models that can understand and reason about 3D scenes is difficult owing to the lack of data sources for 3D supervised training and large-scale training regimes. In this work we ask - How can the knowledge in a pre-trained language model be leveraged for 3D scene understanding without any 3D pre-training. The aim of this work is to establish whether pre-trained LLMs possess priors/knowledge required for reasoning in 3D space and how can we prompt them such that they can be used for general purpose spatial reasoning and object understanding in 3D. To this end, we present SceneGPT, an LLM based scene understanding system which can perform 3D spatial reasoning without training or explicit 3D supervision. The key components of our framework are - 1) a 3D scene graph, that serves as scene representation, encoding the objects in the scene and their spatial relationships 2) a pre-trained LLM that can be adapted with in context learning for 3D spatial reasoning. We evaluate our framework qualitatively on object and scene understanding tasks including object semantics, physical properties and affordances (object-level) and spatial understanding (scene-level).
Related papers
- Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations [37.209795186399326]
Scene-R1, a video-grounded framework, learns to reason about 3D scenes without any point-wise 3D instance supervision.<n> Scene-R1 can also adapt to the 3D visual question answering task to answer free-form questions directly from video.
arXiv Detail & Related papers (2025-06-21T02:29:10Z) - Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs [72.11701578308804]
This paper categorizes recent 3D Vision-Language Models into 3D object-centric, 2D image-based, and 3D scene-centric approaches.<n>Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches.<n>Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions.
arXiv Detail & Related papers (2025-06-05T17:56:12Z) - MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.
Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.
We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z) - SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining [100.23919762298227]
We introduce SceneSplat, the first large-scale 3D indoor scene understanding approach that operates on 3DGS.
We also propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes.
SceneSplat-7K is the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes.
arXiv Detail & Related papers (2025-03-23T12:50:25Z) - 3D Scene Graph Guided Vision-Language Pre-training [11.131667398927394]
3D vision-language (VL) reasoning has gained significant attention due to its potential to bridge the 3D physical world with natural language descriptions.
Existing approaches typically follow task-specific, highly specialized paradigms.
This paper proposes a 3D scene graph-guided vision-language pre-training framework.
arXiv Detail & Related papers (2024-11-27T16:10:44Z) - Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space [10.49905491984899]
We redefine the problem to segment the 3D volume and propose the following methods for better 3D understanding.
We directly supervise the 3D points to train the language embedding field, unlike previous methods that anchor supervision at 2D pixels.
We transfer the learned language field to 3DGS, achieving the first real-time rendering speed without sacrificing training time or accuracy.
arXiv Detail & Related papers (2024-08-14T09:50:02Z) - SceneTeller: Language-to-3D Scene Generation [15.209079637302905]
Given a prompt in natural language describing the object placement in the room, our method produces a high-quality 3D scene corresponding to it.
Our turnkey pipeline produces state-of-the-art 3D scenes, while being easy to use even for novices.
arXiv Detail & Related papers (2024-07-30T10:45:28Z) - Agent3D-Zero: An Agent for Zero-shot 3D Understanding [79.88440434836673]
Agent3D-Zero is an innovative 3D-aware agent framework addressing the 3D scene understanding.
We propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding.
A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints.
arXiv Detail & Related papers (2024-03-18T14:47:03Z) - 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding [12.823274886850697]
We introduce a novel and efficient prompt tuning paradigm, 3DMIT.
This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information.
We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain.
arXiv Detail & Related papers (2024-01-06T12:20:18Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Chat-3D: Data-efficiently Tuning Large Language Model for Universal
Dialogue of 3D Scenes [56.727745047799246]
3D scene understanding has gained significant attention due to its wide range of applications.
This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs.
arXiv Detail & Related papers (2023-08-17T03:52:15Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - CLIP-Guided Vision-Language Pre-training for Question Answering in 3D
Scenes [68.61199623705096]
We design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations.
We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings.
We evaluate our model's 3D world reasoning capability on the downstream task of 3D Visual Question Answering.
arXiv Detail & Related papers (2023-04-12T16:52:29Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - Learning 3D Scene Priors with 2D Supervision [37.79852635415233]
We propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth.
Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories.
Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction.
arXiv Detail & Related papers (2022-11-25T15:03:32Z) - Language-Assisted 3D Feature Learning for Semantic Scene Understanding [26.414294993374543]
Language-assisted 3D feature learning can be combined with modern object detection and instance segmentation methods.
Experiments on several benchmarks of 3D-only and 3D-language tasks demonstrate the effectiveness of our language-assisted 3D feature learning.
arXiv Detail & Related papers (2022-11-25T13:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.