Language-Assisted 3D Feature Learning for Semantic Scene Understanding
- URL: http://arxiv.org/abs/2211.14091v1
- Date: Fri, 25 Nov 2022 13:21:59 GMT
- Title: Language-Assisted 3D Feature Learning for Semantic Scene Understanding
- Authors: Junbo Zhang, Guofan Fan, Guanghan Wang, Zhengyuan Su, Kaisheng Ma, Li
Yi
- Abstract summary: Language-assisted 3D feature learning can be combined with modern object detection and instance segmentation methods.
Experiments on several benchmarks of 3D-only and 3D-language tasks demonstrate the effectiveness of our language-assisted 3D feature learning.
- Score: 26.414294993374543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning descriptive 3D features is crucial for understanding 3D scenes with
diverse objects and complex structures. However, it is usually unknown whether
important geometric attributes and scene context obtain enough emphasis in an
end-to-end trained 3D scene understanding network. To guide 3D feature learning
toward important geometric attributes and scene context, we explore the help of
textual scene descriptions. Given some free-form descriptions paired with 3D
scenes, we extract the knowledge regarding the object relationships and object
attributes. We then inject the knowledge to 3D feature learning through three
classification-based auxiliary tasks. This language-assisted training can be
combined with modern object detection and instance segmentation methods to
promote 3D semantic scene understanding, especially in a label-deficient
regime. Moreover, the 3D feature learned with language assistance is better
aligned with the language features, which can benefit various 3D-language
multimodal tasks. Experiments on several benchmarks of 3D-only and 3D-language
tasks demonstrate the effectiveness of our language-assisted 3D feature
learning. Code is available at
https://github.com/Asterisci/Language-Assisted-3D.
Related papers
- Functionality understanding and segmentation in 3D scenes [6.1744362771344]
We introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes.
Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning.
We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task.
arXiv Detail & Related papers (2024-11-25T11:57:48Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - Agent3D-Zero: An Agent for Zero-shot 3D Understanding [79.88440434836673]
Agent3D-Zero is an innovative 3D-aware agent framework addressing the 3D scene understanding.
We propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding.
A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints.
arXiv Detail & Related papers (2024-03-18T14:47:03Z) - SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding [37.47195477043883]
3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents.
We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes.
We demonstrate this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS) for 3D vision-language learning.
arXiv Detail & Related papers (2024-01-17T17:04:35Z) - 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding [12.823274886850697]
We introduce a novel and efficient prompt tuning paradigm, 3DMIT.
This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information.
We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain.
arXiv Detail & Related papers (2024-01-06T12:20:18Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Chat-3D: Data-efficiently Tuning Large Language Model for Universal
Dialogue of 3D Scenes [56.727745047799246]
3D scene understanding has gained significant attention due to its wide range of applications.
This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs.
arXiv Detail & Related papers (2023-08-17T03:52:15Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.