Related papers: 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

URL: http://arxiv.org/abs/2412.18450v2
Date: Wed, 25 Dec 2024 11:13:41 GMT
Title: 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
Authors: Tatiana Zemskova, Dmitry Yudin,
Abstract summary: A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them.<n>In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph.<n>The learnable representation is used as input for LLMs to perform 3D vision-language tasks.
Score: 0.5755004576310334
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.

Related papers

Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space [58.623106094568776]
3D morphable models (3DMMs) are a powerful tool to represent the possible shapes and appearances of an object category.<n>We introduce a new method, Common3D, that learns 3DMMs of common objects in a fully self-supervised manner from a collection of object-centric videos.<n>Common3D is the first completely self-supervised method that can solve various vision tasks in a zero-shot manner.
arXiv Detail & Related papers (2025-04-30T15:42:23Z)
Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces [113.91791599146786]
We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. Unlike traditional 3D scene graphs that focus on spatial relationships of objects, functional 3D scene graphs capture objects, interactive elements, and their functional relationships. We evaluate our approach on an extended SceneFun3D dataset and a newly collected dataset, FunGraph3D, both annotated with functional 3D scene graphs.
arXiv Detail & Related papers (2025-03-24T22:53:19Z)
Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph [0.3926357402982764]
We propose a modular approach called BBQ that constructs 3D scene graph representation with metric and semantic edges. BBQ employs robust DINO-powered associations to construct 3D object-centric map. We show that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods.
arXiv Detail & Related papers (2024-06-11T09:57:04Z)
Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z)
Transcrib3D: 3D Referring Expression Resolution through Large Language Models [28.121606686759225]
We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models. Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks. We show that our method enables a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions.
arXiv Detail & Related papers (2024-04-30T02:48:20Z)
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z)
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning [125.90002884194838]
ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes. It is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. We demonstrate the utility of this representation through a number of downstream planning tasks.
arXiv Detail & Related papers (2023-09-28T17:53:38Z)
CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z)
OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z)
Interactive Annotation of 3D Object Geometry using 2D Scribbles [84.51514043814066]
In this paper, we propose an interactive framework for annotating 3D object geometry from point cloud data and RGB imagery. Our framework targets naive users without artistic or graphics expertise.
arXiv Detail & Related papers (2020-08-24T21:51:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.