PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm
- URL: http://arxiv.org/abs/2310.08586v3
- Date: Tue, 27 Feb 2024 13:53:43 GMT
- Title: PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm
- Authors: Haoyi Zhu, Honghui Yang, Xiaoyang Wu, Di Huang, Sha Zhang, Xianglong
He, Hengshuang Zhao, Chunhua Shen, Yu Qiao, Tong He, Wanli Ouyang
- Abstract summary: We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
- Score: 114.47216525866435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In contrast to numerous NLP and 2D vision foundational models, learning a 3D
foundational model poses considerably greater challenges. This is primarily due
to the inherent data variability and diversity of downstream tasks. In this
paper, we introduce a novel universal 3D pre-training framework designed to
facilitate the acquisition of efficient 3D representation, thereby establishing
a pathway to 3D foundational models. Considering that informative 3D features
should encode rich geometry and appearance cues that can be utilized to render
realistic images, we propose to learn 3D representations by differentiable
neural rendering. We train a 3D backbone with a devised volumetric neural
renderer by comparing the rendered with the real images. Notably, our approach
seamlessly integrates the learned 3D encoder into various downstream tasks.
These tasks encompass not only high-level challenges such as 3D detection and
segmentation but also low-level objectives like 3D reconstruction and image
synthesis, spanning both indoor and outdoor scenarios. Besides, we also
illustrate the capability of pre-training a 2D backbone using the proposed
methodology, surpassing conventional pre-training methods by a large margin.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor
and outdoor benchmarks, implying its effectiveness. Code and models are
available at https://github.com/OpenGVLab/PonderV2.
Related papers
- ImOV3D: Learning Open-Vocabulary Point Clouds 3D Object Detection from Only 2D Images [19.02348585677397]
Open-vocabulary 3D object detection (OV-3Det) aims to generalize beyond the limited number of base categories labeled during the training phase.
The biggest bottleneck is the scarcity of annotated 3D data, whereas 2D image datasets are abundant and richly annotated.
We propose a novel framework ImOV3D to leverage pseudo multimodal representation containing both images and point clouds (PC) to close the modality gap.
arXiv Detail & Related papers (2024-10-31T15:02:05Z) - ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images [47.682942867405224]
ConDense is a framework for 3D pre-training utilizing existing 2D networks and large-scale multi-view datasets.
We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline.
arXiv Detail & Related papers (2024-08-30T05:57:01Z) - Improving 2D Feature Representations by 3D-Aware Fine-Tuning [17.01280751430423]
Current visual foundation models are trained purely on unstructured 2D data.
We show that fine-tuning on 3D-aware data improves the quality of emerging semantic features.
arXiv Detail & Related papers (2024-07-29T17:59:21Z) - DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - Probing the 3D Awareness of Visual Foundation Models [56.68380136809413]
We analyze the 3D awareness of visual foundation models.
We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
arXiv Detail & Related papers (2024-04-12T17:58:04Z) - Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z) - DRaCoN -- Differentiable Rasterization Conditioned Neural Radiance
Fields for Articulated Avatars [92.37436369781692]
We present DRaCoN, a framework for learning full-body volumetric avatars.
It exploits the advantages of both the 2D and 3D neural rendering techniques.
Experiments on the challenging ZJU-MoCap and Human3.6M datasets indicate that DRaCoN outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-03-29T17:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.