FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for
Open-Vocabulary 3D Detection
- URL: http://arxiv.org/abs/2312.14465v1
- Date: Fri, 22 Dec 2023 06:34:23 GMT
- Title: FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for
Open-Vocabulary 3D Detection
- Authors: Dongmei Zhang, Chang Li, Ray Zhang, Shenghao Xie, Wei Xue, Xiaodong
Xie, Shanghang Zhang
- Abstract summary: FM-OV3D is a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection.
We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP.
Experiments show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model.
- Score: 40.965892255504144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The superior performances of pre-trained foundation models in various visual
tasks underscore their potential to enhance the 2D models' open-vocabulary
ability. Existing methods explore analogous applications in the 3D space.
However, most of them only center around knowledge extraction from singular
foundation models, which limits the open-vocabulary ability of 3D models. We
hypothesize that leveraging complementary pre-trained knowledge from various
foundation models can improve knowledge transfer from 2D pre-trained visual
language models to the 3D space. In this work, we propose FM-OV3D, a method of
Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D
Detection, which improves the open-vocabulary localization and recognition
abilities of 3D model by blending knowledge from multiple pre-trained
foundation models, achieving true open-vocabulary without facing constraints
from original 3D datasets. Specifically, to learn the open-vocabulary 3D
localization ability, we adopt the open-vocabulary localization knowledge of
the Grounded-Segment-Anything model. For open-vocabulary 3D recognition
ability, We leverage the knowledge of generative foundation models, including
GPT-3 and Stable Diffusion models, and cross-modal discriminative models like
CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D
object detection show that our model efficiently learns knowledge from multiple
foundation models to enhance the open-vocabulary ability of the 3D model and
successfully achieves state-of-the-art performance in open-vocabulary 3D object
detection tasks. Code is released at
https://github.com/dmzhang0425/FM-OV3D.git.
Related papers
- Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation [67.36775428466045]
We propose Geometry Guided Self-Distillation (GGSD) to learn superior 3D representations from 2D pre-trained models.
Due to the advantages of 3D representation, the performance of the distilled 3D student model can significantly surpass that of the 2D teacher model.
arXiv Detail & Related papers (2024-07-18T10:13:56Z) - Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene.
We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes.
We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z) - OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation [67.56268991234371]
OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6% on average.
Code and pre-trained models will be released later.
arXiv Detail & Related papers (2024-03-28T17:05:04Z) - POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images [32.33170182669095]
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images.
The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads.
The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks.
arXiv Detail & Related papers (2024-01-17T18:51:53Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Weakly Supervised 3D Open-vocabulary Segmentation [104.07740741126119]
We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner.
We distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF)
A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process.
arXiv Detail & Related papers (2023-05-23T14:16:49Z) - Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with
Foundation Models [18.315856283440386]
Foundation models have achieved remarkable results in 2D and language tasks like image segmentation, object detection, and visual-language understanding.
Their potential to enrich 3D scene representation learning is largely untapped due to the existence of the domain gap.
We propose an innovative methodology called Bridge3D to address this gap by pre-training 3D models using features, semantic masks, and sourced captions from foundation models.
arXiv Detail & Related papers (2023-05-15T16:36:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.