P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders
- URL: http://arxiv.org/abs/2408.10007v2
- Date: Wed, 12 Mar 2025 14:13:37 GMT
- Title: P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders
- Authors: Xuechao Chen, Ying Chen, Jialin Li, Qiang Nie, Hanqiu Deng, Yong Liu, Qixing Huang, Yang Li,
- Abstract summary: Pre-training in 3D is pivotal for advancing 3D perception tasks.<n>However, the scarcity of clean 3D data poses significant challenges for scaling 3D pre-training efforts.<n>We introduce an innovative self-supervised pre-training framework.<n>Our method achieves state-of-the-art performance in 3D classification, detection, and few-shot learning.
- Score: 34.64343313442465
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Pre-training in 3D is pivotal for advancing 3D perception tasks. However, the scarcity of clean 3D data poses significant challenges for scaling 3D pre-training efforts. Drawing inspiration from semi-supervised learning, which effectively combines limited labeled data with abundant unlabeled data, we introduce an innovative self-supervised pre-training framework. This framework leverages both authentic 3D data and pseudo-3D data generated from images using a robust depth estimation model. Another critical challenge is the efficiency of the pre-training process. Existing approaches, such as Point-BERT and Point-MAE, utilize k-nearest neighbors for 3D token embedding, resulting in quadratic time complexity. To address this, we propose a novel token embedding strategy with linear time complexity, coupled with a training-efficient 2D reconstruction target. Our method not only achieves state-of-the-art performance in 3D classification, detection, and few-shot learning but also ensures high efficiency in both pre-training and downstream fine-tuning processes.
Related papers
- DINeMo: Learning Neural Mesh Models with no 3D Annotations [7.21992608540601]
Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding.
Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective.
We present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence.
arXiv Detail & Related papers (2025-03-26T04:23:53Z) - Bayesian Self-Training for Semi-Supervised 3D Segmentation [59.544558398992386]
3D segmentation is a core problem in computer vision.
densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive.
Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set.
arXiv Detail & Related papers (2024-09-12T14:54:31Z) - ALPI: Auto-Labeller with Proxy Injection for 3D Object Detection using 2D Labels Only [5.699475977818167]
3D object detection plays a crucial role in various applications such as autonomous vehicles, robotics and augmented reality.
We propose a weakly supervised 3D annotator that relies solely on 2D bounding box annotations from images, along with size priors.
arXiv Detail & Related papers (2024-07-24T11:58:31Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Learning Occupancy for Monocular 3D Object Detection [25.56336546513198]
We propose textbfOccupancyM3D, a method of learning occupancy for monocular 3D detection.
It directly learns occupancy in frustum and 3D space, leading to more discriminative and informative 3D features and representations.
Experiments on KITTI and open datasets demonstrate that the proposed method achieves a new state of the art and surpasses other methods by a significant margin.
arXiv Detail & Related papers (2023-05-25T04:03:46Z) - Video Pretraining Advances 3D Deep Learning on Chest CT Tasks [63.879848037679224]
Pretraining on large natural image classification datasets has aided model development on data-scarce 2D medical tasks.
These 2D models have been surpassed by 3D models on 3D computer vision benchmarks.
We show video pretraining for 3D models can enable higher performance on smaller datasets for 3D medical tasks.
arXiv Detail & Related papers (2023-04-02T14:46:58Z) - Few-shot Class-incremental Learning for 3D Point Cloud Objects [11.267975876074706]
Few-shot class-incremental learning (FSCIL) aims to incrementally fine-tune a model trained on base classes for a novel set of classes.
Recent efforts of FSCIL address this problem primarily on 2D image data.
Due to the advancement of camera technology, 3D point cloud data has become more available than ever.
arXiv Detail & Related papers (2022-05-30T16:33:53Z) - Data Efficient 3D Learner via Knowledge Transferred from 2D Model [30.077342050473515]
We deal with the data scarcity challenge of 3D tasks by transferring knowledge from strong 2D models via RGB-D images.
We utilize a strong and well-trained semantic segmentation model for 2D images to augment RGB-D images with pseudo-label.
Our method already outperforms existing state-of-the-art that is tailored for 3D label efficiency.
arXiv Detail & Related papers (2022-03-16T09:14:44Z) - Advancing 3D Medical Image Analysis with Variable Dimension Transform
based Supervised 3D Pre-training [45.90045513731704]
This paper revisits an innovative yet simple fully-supervised 3D network pre-training framework.
With a redesigned 3D network architecture, reformulated natural images are used to address the problem of data scarcity.
Comprehensive experiments on four benchmark datasets demonstrate that the proposed pre-trained models can effectively accelerate convergence.
arXiv Detail & Related papers (2022-01-05T03:11:21Z) - RandomRooms: Unsupervised Pre-training from Synthetic Shapes and
Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets.
Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications.
In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z) - Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D
Human Pose Estimation [107.07047303858664]
Large-scale human datasets with 3D ground-truth annotations are difficult to obtain in the wild.
We address this problem by augmenting existing 2D datasets with high-quality 3D pose fits.
The resulting annotations are sufficient to train from scratch 3D pose regressor networks that outperform the current state-of-the-art on in-the-wild benchmarks.
arXiv Detail & Related papers (2020-04-07T20:21:18Z) - D3Feat: Joint Learning of Dense Detection and Description of 3D Local
Features [51.04841465193678]
We leverage a 3D fully convolutional network for 3D point clouds.
We propose a novel and practical learning mechanism that densely predicts both a detection score and a description feature for each 3D point.
Our method achieves state-of-the-art results in both indoor and outdoor scenarios.
arXiv Detail & Related papers (2020-03-06T12:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.