Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation
- URL: http://arxiv.org/abs/2406.02548v2
- Date: Thu, 20 Jun 2024 09:29:26 GMT
- Title: Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation
- Authors: Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan,
- Abstract summary: We propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D.
It effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation.
We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector.
- Score: 91.40798599544136
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to $\sim$16$\times$ speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7\% while operating at 22 seconds per scene. Code and model are available at github.com/aminebdj/OpenYOLO3D.
Related papers
- EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration.
An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - Unified Scene Representation and Reconstruction for 3D Large Language Models [40.693839066536505]
Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models.
We introduce Uni3DR2 extracts 3D geometric and semantic aware representation features via the frozen 2D foundation models.
Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs.
arXiv Detail & Related papers (2024-04-19T17:58:04Z) - OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D
Data [15.53270401654078]
OVIR-3D is a method for open-vocabulary 3D object instance retrieval without using any 3D data for training.
It is achieved by a multi-view fusion of text-aligned 2D region proposals into 3D space.
Experiments on public datasets and a real robot show the effectiveness of the method and its potential for applications in robot navigation and manipulation.
arXiv Detail & Related papers (2023-11-06T05:00:00Z) - Fully Sparse Fusion for 3D Object Detection [69.32694845027927]
Currently prevalent multimodal 3D detection methods are built upon LiDAR-based detectors that usually use dense Bird's-Eye-View feature maps.
Fully sparse architecture is gaining attention as they are highly efficient in long-range perception.
In this paper, we study how to effectively leverage image modality in the emerging fully sparse architecture.
arXiv Detail & Related papers (2023-04-24T17:57:43Z) - Segment Anything in 3D with Radiance Fields [83.14130158502493]
This paper generalizes the Segment Anything Model (SAM) to segment 3D objects.
We refer to the proposed solution as SA3D, short for Segment Anything in 3D.
We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds.
arXiv Detail & Related papers (2023-04-24T17:57:15Z) - TR3D: Towards Real-Time Indoor 3D Object Detection [6.215404942415161]
TR3D is a fully-convolutional 3D object detection model trained end-to-end.
To take advantage of both point cloud and RGB inputs, we introduce an early fusion of 2D and 3D features.
Our model with early feature fusion, which we refer to as TR3D+FF, outperforms existing 3D object detection approaches on the SUN RGB-D dataset.
arXiv Detail & Related papers (2023-02-06T15:25:50Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Multi-Modality Task Cascade for 3D Object Detection [22.131228757850373]
Many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data.
We propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions.
We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance.
arXiv Detail & Related papers (2021-07-08T17:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.