OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D
Data
- URL: http://arxiv.org/abs/2311.02873v1
- Date: Mon, 6 Nov 2023 05:00:00 GMT
- Title: OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D
Data
- Authors: Shiyang Lu, Haonan Chang, Eric Pu Jing, Abdeslam Boularias, Kostas
Bekris
- Abstract summary: OVIR-3D is a method for open-vocabulary 3D object instance retrieval without using any 3D data for training.
It is achieved by a multi-view fusion of text-aligned 2D region proposals into 3D space.
Experiments on public datasets and a real robot show the effectiveness of the method and its potential for applications in robot navigation and manipulation.
- Score: 15.53270401654078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work presents OVIR-3D, a straightforward yet effective method for
open-vocabulary 3D object instance retrieval without using any 3D data for
training. Given a language query, the proposed method is able to return a
ranked set of 3D object instance segments based on the feature similarity of
the instance and the text query. This is achieved by a multi-view fusion of
text-aligned 2D region proposals into 3D space, where the 2D region proposal
network could leverage 2D datasets, which are more accessible and typically
larger than 3D datasets. The proposed fusion process is efficient as it can be
performed in real-time for most indoor 3D scenes and does not require
additional training in 3D space. Experiments on public datasets and a real
robot show the effectiveness of the method and its potential for applications
in robot navigation and manipulation.
Related papers
- ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images [47.682942867405224]
ConDense is a framework for 3D pre-training utilizing existing 2D networks and large-scale multi-view datasets.
We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline.
arXiv Detail & Related papers (2024-08-30T05:57:01Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation [91.40798599544136]
We propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D.
It effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation.
We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector.
arXiv Detail & Related papers (2024-06-04T17:59:31Z) - POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images [32.33170182669095]
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images.
The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads.
The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks.
arXiv Detail & Related papers (2024-01-17T18:51:53Z) - Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic
Segmentation [87.54570024320354]
State-of-the-art methods for large-scale driving-scene LiDAR semantic segmentation often project and process the point clouds in the 2D space.
A straightforward solution to tackle the issue of 3D-to-2D projection is to keep the 3D representation and process the points in the 3D space.
We develop a 3D cylinder partition and a 3D cylinder convolution based framework, termed as Cylinder3D, which exploits the 3D topology relations and structures of driving-scene point clouds.
arXiv Detail & Related papers (2020-08-04T13:56:19Z) - Parameter-Efficient Person Re-identification in the 3D Space [51.092669618679615]
We project 2D images to a 3D space and introduce a novel parameter-efficient Omni-scale Graph Network (OG-Net) to learn the pedestrian representation directly from 3D point clouds.
OG-Net effectively exploits the local information provided by sparse 3D points and takes advantage of the structure and appearance information in a coherent manner.
We are among the first attempts to conduct person re-identification in the 3D space.
arXiv Detail & Related papers (2020-06-08T13:20:33Z) - One Point, One Object: Simultaneous 3D Object Segmentation and 6-DOF Pose Estimation [0.7252027234425334]
We propose a method for simultaneous 3D object segmentation and 6-DOF pose estimation in pure 3D point clouds scenes.
The key component of our method is a multi-task CNN architecture that can simultaneously predict the 3D object segmentation and 6-DOF pose estimation in pure 3D point clouds.
For experimental evaluation, we generate expanded training data for two state-of-the-arts 3D object datasets citePLciteTLINEMOD by using Augmented Reality (AR)
arXiv Detail & Related papers (2019-12-27T13:48:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.