Related papers: P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders

P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders

URL: http://arxiv.org/abs/2408.10007v1
Date: Mon, 19 Aug 2024 13:59:53 GMT
Title: P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders
Authors: Xuechao Chen, Ying Chen, Jialin Li, Qiang Nie, Yong Liu, Qixing Huang, Yang Li,
Abstract summary: We propose a novel self-supervised pre-training framework utilizing the real 3D data and the pseudo-3D data lifted from images by a large depth estimation model. Our method achieves state-of-the-art performance in 3D classification and few-shot learning while maintaining high pre-training and downstream fine-tuning efficiency.
Score: 32.85484320025852
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: 3D pre-training is crucial to 3D perception tasks. However, limited by the difficulties in collecting clean 3D data, 3D pre-training consistently faced data scaling challenges. Inspired by semi-supervised learning leveraging limited labeled data and a large amount of unlabeled data, in this work, we propose a novel self-supervised pre-training framework utilizing the real 3D data and the pseudo-3D data lifted from images by a large depth estimation model. Another challenge lies in the efficiency. Previous methods such as Point-BERT and Point-MAE, employ k nearest neighbors to embed 3D tokens, requiring quadratic time complexity. To efficiently pre-train on such a large amount of data, we propose a linear-time-complexity token embedding strategy and a training-efficient 2D reconstruction target. Our method achieves state-of-the-art performance in 3D classification and few-shot learning while maintaining high pre-training and downstream fine-tuning efficiency.

Related papers

Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D [68.23391872643268]
LOCATE 3D is a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp"<n>It operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices.
arXiv Detail & Related papers (2025-04-19T02:51:24Z)
DINeMo: Learning Neural Mesh Models with no 3D Annotations [7.21992608540601]
Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding. Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective. We present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence.
arXiv Detail & Related papers (2025-03-26T04:23:53Z)
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction [137.34863114016483]
TAR3D is a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQ-VAE) and a Generative Pre-trained Transformer (GPT)<n>We show that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks.
arXiv Detail & Related papers (2024-12-22T08:28:20Z)
Bayesian Self-Training for Semi-Supervised 3D Segmentation [59.544558398992386]
3D segmentation is a core problem in computer vision. densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive. Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set.
arXiv Detail & Related papers (2024-09-12T14:54:31Z)
ALPI: Auto-Labeller with Proxy Injection for 3D Object Detection using 2D Labels Only [5.699475977818167]
3D object detection plays a crucial role in various applications such as autonomous vehicles, robotics and augmented reality. We propose a weakly supervised 3D annotator that relies solely on 2D bounding box annotations from images, along with size priors.
arXiv Detail & Related papers (2024-07-24T11:58:31Z)
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts. Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z)
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z)
Learning Occupancy for Monocular 3D Object Detection [25.56336546513198]
We propose textbfOccupancyM3D, a method of learning occupancy for monocular 3D detection. It directly learns occupancy in frustum and 3D space, leading to more discriminative and informative 3D features and representations. Experiments on KITTI and open datasets demonstrate that the proposed method achieves a new state of the art and surpasses other methods by a significant margin.
arXiv Detail & Related papers (2023-05-25T04:03:46Z)
Video Pretraining Advances 3D Deep Learning on Chest CT Tasks [63.879848037679224]
Pretraining on large natural image classification datasets has aided model development on data-scarce 2D medical tasks. These 2D models have been surpassed by 3D models on 3D computer vision benchmarks. We show video pretraining for 3D models can enable higher performance on smaller datasets for 3D medical tasks.
arXiv Detail & Related papers (2023-04-02T14:46:58Z)
Few-shot Class-incremental Learning for 3D Point Cloud Objects [11.267975876074706]
Few-shot class-incremental learning (FSCIL) aims to incrementally fine-tune a model trained on base classes for a novel set of classes. Recent efforts of FSCIL address this problem primarily on 2D image data. Due to the advancement of camera technology, 3D point cloud data has become more available than ever.
arXiv Detail & Related papers (2022-05-30T16:33:53Z)
Semi-supervised 3D shape segmentation with multilevel consistency and part substitution [21.075426681857024]
We propose an effective semi-supervised method for learning 3D segmentations from a few labeled 3D shapes and a large amount of unlabeled 3D data. For the unlabeled data, we present a novel multilevel consistency loss to enforce consistency of network predictions between perturbed copies of a 3D shape. For the labeled data, we develop a simple yet effective part substitution scheme to augment the labeled 3D shapes with more structural variations to enhance training.
arXiv Detail & Related papers (2022-04-19T11:48:24Z)
Data Efficient 3D Learner via Knowledge Transferred from 2D Model [30.077342050473515]
We deal with the data scarcity challenge of 3D tasks by transferring knowledge from strong 2D models via RGB-D images. We utilize a strong and well-trained semantic segmentation model for 2D images to augment RGB-D images with pseudo-label. Our method already outperforms existing state-of-the-art that is tailored for 3D label efficiency.
arXiv Detail & Related papers (2022-03-16T09:14:44Z)
Advancing 3D Medical Image Analysis with Variable Dimension Transform based Supervised 3D Pre-training [45.90045513731704]
This paper revisits an innovative yet simple fully-supervised 3D network pre-training framework. With a redesigned 3D network architecture, reformulated natural images are used to address the problem of data scarcity. Comprehensive experiments on four benchmark datasets demonstrate that the proposed pre-trained models can effectively accelerate convergence.
arXiv Detail & Related papers (2022-01-05T03:11:21Z)
RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets. Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications. In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z)
Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation [107.07047303858664]
Large-scale human datasets with 3D ground-truth annotations are difficult to obtain in the wild. We address this problem by augmenting existing 2D datasets with high-quality 3D pose fits. The resulting annotations are sufficient to train from scratch 3D pose regressor networks that outperform the current state-of-the-art on in-the-wild benchmarks.
arXiv Detail & Related papers (2020-04-07T20:21:18Z)
D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features [51.04841465193678]
We leverage a 3D fully convolutional network for 3D point clouds. We propose a novel and practical learning mechanism that densely predicts both a detection score and a description feature for each 3D point. Our method achieves state-of-the-art results in both indoor and outdoor scenarios.
arXiv Detail & Related papers (2020-03-06T12:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.