Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding
- URL: http://arxiv.org/abs/2308.00353v1
- Date: Tue, 1 Aug 2023 07:50:14 GMT
- Title: Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding
- Authors: Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan
Qi
- Abstract summary: Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
- Score: 57.47315482494805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-world instance-level scene understanding aims to locate and recognize
unseen object categories that are not present in the annotated dataset. This
task is challenging because the model needs to both localize novel 3D objects
and infer their semantic categories. A key factor for the recent progress in 2D
open-world perception is the availability of large-scale image-text pairs from
the Internet, which cover a wide range of vocabulary concepts. However, this
success is hard to replicate in 3D scenarios due to the scarcity of 3D-text
pairs. To address this challenge, we propose to harness pre-trained
vision-language (VL) foundation models that encode extensive knowledge from
image-text pairs to generate captions for multi-view images of 3D scenes. This
allows us to establish explicit associations between 3D shapes and
semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic
representation learning from captions for object-level categorization, we
design hierarchical point-caption association methods to learn semantic-aware
embeddings that exploit the 3D geometry between 3D points and multi-view
images. In addition, to tackle the localization challenge for novel classes in
the open-world setting, we develop debiased instance localization, which
involves training object grouping modules on unlabeled data using
instance-level pseudo supervision. This significantly improves the
generalization capabilities of instance grouping and thus the ability to
accurately locate novel objects. We conduct extensive experiments on 3D
semantic, instance, and panoptic segmentation tasks, covering indoor and
outdoor scenes across three datasets. Our method outperforms baseline methods
by a significant margin in semantic segmentation (e.g. 34.5%$\sim$65.3%),
instance segmentation (e.g. 21.8%$\sim$54.0%) and panoptic segmentation (e.g.
14.7%$\sim$43.3%). Code will be available.
Related papers
- Search3D: Hierarchical Open-Vocabulary 3D Segmentation [78.47704793095669]
Open-vocabulary 3D segmentation enables the exploration of 3D spaces using free-form text descriptions.
We introduce Search3D, an approach that builds a hierarchical open-vocabulary 3D scene representation.
Our method aims to expand the capabilities of open vocabulary instance-level 3D segmentation by shifting towards a more flexible open-vocabulary 3D search setting.
arXiv Detail & Related papers (2024-09-27T03:44:07Z) - 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance [68.8825501902835]
3DSS-VLG is a weakly supervised approach for 3D Semantic with 2D Vision-Language Guidance.
To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels.
arXiv Detail & Related papers (2024-07-13T09:39:11Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation [46.998093729036334]
We propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D.
To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module.
To facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs.
arXiv Detail & Related papers (2024-01-21T04:13:58Z) - POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images [32.33170182669095]
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images.
The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads.
The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks.
arXiv Detail & Related papers (2024-01-17T18:51:53Z) - RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding [46.253711788685536]
We introduce a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from multiple 2D foundation models.
We devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning.
Our model outperforms prior 3D open-world scene understanding approaches by an average of 17.2% and 9.1% for semantic and instance segmentation.
arXiv Detail & Related papers (2023-04-03T13:30:04Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - PLA: Language-Driven Open-Vocabulary 3D Scene Understanding [57.47315482494805]
Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space.
Recent breakthrough of 2D open-vocabulary perception is driven by Internet-scale paired image-text data with rich vocabulary concepts.
We propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D.
arXiv Detail & Related papers (2022-11-29T15:52:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.