OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
- URL: http://arxiv.org/abs/2309.00616v5
- Date: Mon, 12 Aug 2024 16:58:33 GMT
- Title: OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
- Authors: Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, Joan Lasenby,
- Abstract summary: OpenIns3D is a new 3D-input-only framework for 3D open-vocabulary scene understanding.
It achieves state-of-the-art performance across a wide range of 3D open-vocabulary tasks.
- Score: 32.508069732371105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we introduce OpenIns3D, a new 3D-input-only framework for 3D open-vocabulary scene understanding. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds, the "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision-language models to extract interesting objects, and the "Lookup" module searches through the outcomes of "Snap" to assign category names to the proposed masks. This approach, yet simple, achieves state-of-the-art performance across a wide range of 3D open-vocabulary tasks, including recognition, object detection, and instance segmentation, on both indoor and outdoor datasets. Moreover, OpenIns3D facilitates effortless switching between different 2D detectors without requiring retraining. When integrated with powerful 2D open-world models, it achieves excellent results in scene understanding tasks. Furthermore, when combined with LLM-powered 2D models, OpenIns3D exhibits an impressive capability to comprehend and process highly complex text queries that demand intricate reasoning and real-world knowledge. Project page: https://zheninghuang.github.io/OpenIns3D/
Related papers
- EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration.
An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z) - OpenSU3D: Open World 3D Scene Understanding using Foundation Models [2.1262749936758216]
We present a novel, scalable approach for constructing open set, instance-level 3D scene representations.
Existing methods require pre-constructed 3D scenes and face scalability issues due to per-point feature vector learning.
We evaluate our proposed approach on multiple scenes from ScanNet and Replica datasets demonstrating zero-shot generalization capabilities.
arXiv Detail & Related papers (2024-07-19T13:01:12Z) - Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation [91.40798599544136]
We propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D.
It effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation.
We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector.
arXiv Detail & Related papers (2024-06-04T17:59:31Z) - OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding [54.981605111365056]
This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting (3DGS) capable of 3D point-level open vocabulary understanding.
Our primary motivation stems from observing that existing 3DGS-based open vocabulary methods mainly focus on 2D pixel-level parsing.
arXiv Detail & Related papers (2024-06-04T07:42:33Z) - Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding [41.96929575241655]
We introduce OV-SAM3D, a training-free method for understanding open-vocabulary 3D scenes.
This framework is designed to perform understanding tasks for any 3D scene without requiring prior knowledge of the scene.
Empirical evaluations on the ScanNet200 and nuScenes datasets demonstrate that our approach surpasses existing open-vocabulary methods in unknown open-world environments.
arXiv Detail & Related papers (2024-05-24T14:07:57Z) - POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images [32.33170182669095]
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images.
The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads.
The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks.
arXiv Detail & Related papers (2024-01-17T18:51:53Z) - OpenMask3D: Open-Vocabulary 3D Instance Segmentation [84.58747201179654]
OpenMask3D is a zero-shot approach for open-vocabulary 3D instance segmentation.
Our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings.
arXiv Detail & Related papers (2023-06-23T17:36:44Z) - Segment Anything in 3D with Radiance Fields [83.14130158502493]
This paper generalizes the Segment Anything Model (SAM) to segment 3D objects.
We refer to the proposed solution as SA3D, short for Segment Anything in 3D.
We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds.
arXiv Detail & Related papers (2023-04-24T17:57:15Z) - OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision.
We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space.
This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.