PLA: Language-Driven Open-Vocabulary 3D Scene Understanding
- URL: http://arxiv.org/abs/2211.16312v2
- Date: Wed, 22 Mar 2023 05:17:01 GMT
- Title: PLA: Language-Driven Open-Vocabulary 3D Scene Understanding
- Authors: Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan
Qi
- Abstract summary: Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space.
Recent breakthrough of 2D open-vocabulary perception is driven by Internet-scale paired image-text data with rich vocabulary concepts.
We propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D.
- Score: 57.47315482494805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary scene understanding aims to localize and recognize unseen
categories beyond the annotated label space. The recent breakthrough of 2D
open-vocabulary perception is largely driven by Internet-scale paired
image-text data with rich vocabulary concepts. However, this success cannot be
directly transferred to 3D scenarios due to the inaccessibility of large-scale
3D-text pairs. To this end, we propose to distill knowledge encoded in
pre-trained vision-language (VL) foundation models through captioning
multi-view images from 3D, which allows explicitly associating 3D and
semantic-rich captions. Further, to foster coarse-to-fine visual-semantic
representation learning from captions, we design hierarchical 3D-caption pairs,
leveraging geometric constraints between 3D scenes and multi-view images.
Finally, by employing contrastive learning, the model learns language-aware
embeddings that connect 3D and text for open-vocabulary tasks. Our method not
only remarkably outperforms baseline methods by 25.8% $\sim$ 44.7% hIoU and
14.5% $\sim$ 50.4% hAP$_{50}$ in open-vocabulary semantic and instance
segmentation, but also shows robust transferability on challenging zero-shot
domain transfer tasks. See the project website at
https://dingry.github.io/projects/PLA.
Related papers
- Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation [46.998093729036334]
We propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D.
To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module.
To facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs.
arXiv Detail & Related papers (2024-01-21T04:13:58Z) - POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images [32.33170182669095]
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images.
The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads.
The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks.
arXiv Detail & Related papers (2024-01-17T18:51:53Z) - SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding [37.47195477043883]
3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents.
We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes.
We demonstrate this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS) for 3D vision-language learning.
arXiv Detail & Related papers (2024-01-17T17:04:35Z) - Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training.
We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud.
Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - Multi-CLIP: Contrastive Vision-Language Pre-training for Question
Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore.
We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.