PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm
- URL: http://arxiv.org/abs/2310.08586v3
- Date: Tue, 27 Feb 2024 13:53:43 GMT
- Title: PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm
- Authors: Haoyi Zhu, Honghui Yang, Xiaoyang Wu, Di Huang, Sha Zhang, Xianglong
He, Hengshuang Zhao, Chunhua Shen, Yu Qiao, Tong He, Wanli Ouyang
- Abstract summary: We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
- Score: 114.47216525866435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In contrast to numerous NLP and 2D vision foundational models, learning a 3D
foundational model poses considerably greater challenges. This is primarily due
to the inherent data variability and diversity of downstream tasks. In this
paper, we introduce a novel universal 3D pre-training framework designed to
facilitate the acquisition of efficient 3D representation, thereby establishing
a pathway to 3D foundational models. Considering that informative 3D features
should encode rich geometry and appearance cues that can be utilized to render
realistic images, we propose to learn 3D representations by differentiable
neural rendering. We train a 3D backbone with a devised volumetric neural
renderer by comparing the rendered with the real images. Notably, our approach
seamlessly integrates the learned 3D encoder into various downstream tasks.
These tasks encompass not only high-level challenges such as 3D detection and
segmentation but also low-level objectives like 3D reconstruction and image
synthesis, spanning both indoor and outdoor scenarios. Besides, we also
illustrate the capability of pre-training a 2D backbone using the proposed
methodology, surpassing conventional pre-training methods by a large margin.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor
and outdoor benchmarks, implying its effectiveness. Code and models are
available at https://github.com/OpenGVLab/PonderV2.
Related papers
- DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving.
It predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
It is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - Probing the 3D Awareness of Visual Foundation Models [56.68380136809413]
We analyze the 3D awareness of visual foundation models.
We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
arXiv Detail & Related papers (2024-04-12T17:58:04Z) - Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z) - DRaCoN -- Differentiable Rasterization Conditioned Neural Radiance
Fields for Articulated Avatars [92.37436369781692]
We present DRaCoN, a framework for learning full-body volumetric avatars.
It exploits the advantages of both the 2D and 3D neural rendering techniques.
Experiments on the challenging ZJU-MoCap and Human3.6M datasets indicate that DRaCoN outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-03-29T17:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.