Point Cloud Self-supervised Learning via 3D to Multi-view Masked
Autoencoder
- URL: http://arxiv.org/abs/2311.10887v1
- Date: Fri, 17 Nov 2023 22:10:03 GMT
- Title: Point Cloud Self-supervised Learning via 3D to Multi-view Masked
Autoencoder
- Authors: Zhimin Chen, Yingwei Li, Longlong Jing, Liang Yang, Bing Li
- Abstract summary: Multi-Modality Masked AutoEncoders (MAE) methods leverage both 2D images and 3D point clouds for pre-training.
We introduce a novel approach employing a 3D to multi-view masked autoencoder to fully harness the multi-modal attributes of 3D point clouds.
Our method outperforms state-of-the-art counterparts by a large margin in a variety of downstream tasks.
- Score: 21.73287941143304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the field of 3D self-supervised learning has witnessed
significant progress, resulting in the emergence of Multi-Modality Masked
AutoEncoders (MAE) methods that leverage both 2D images and 3D point clouds for
pre-training. However, a notable limitation of these approaches is that they do
not fully utilize the multi-view attributes inherent in 3D point clouds, which
is crucial for a deeper understanding of 3D structures. Building upon this
insight, we introduce a novel approach employing a 3D to multi-view masked
autoencoder to fully harness the multi-modal attributes of 3D point clouds. To
be specific, our method uses the encoded tokens from 3D masked point clouds to
generate original point clouds and multi-view depth images across various
poses. This approach not only enriches the model's comprehension of geometric
structures but also leverages the inherent multi-modal properties of point
clouds. Our experiments illustrate the effectiveness of the proposed method for
different tasks and under different settings. Remarkably, our method
outperforms state-of-the-art counterparts by a large margin in a variety of
downstream tasks, including 3D object classification, few-shot learning, part
segmentation, and 3D object detection. Code will be available at:
https://github.com/Zhimin-C/Multiview-MAE
Related papers
- Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Learning 3D Representations from 2D Pre-trained Models via
Image-to-Point Masked Autoencoders [52.91248611338202]
We propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE.
By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding.
I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity.
arXiv Detail & Related papers (2022-12-13T17:59:20Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud
Pre-training [56.81809311892475]
Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers.
We propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds.
arXiv Detail & Related papers (2022-05-28T11:22:53Z) - Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic
Segmentation [3.5939555573102853]
Recent works on 3D semantic segmentation propose to exploit the synergy between images and point clouds by processing each modality with a dedicated network.
We propose an end-to-end trainable multi-view aggregation model leveraging the viewing conditions of 3D points to merge features from images taken at arbitrary positions.
Our method can combine standard 2D and 3D networks and outperforms both 3D models operating on colorized point clouds and hybrid 2D/3D networks.
arXiv Detail & Related papers (2022-04-15T17:10:48Z) - Sparse Fuse Dense: Towards High Quality 3D Detection with Depth
Completion [31.52721107477401]
Current LiDAR-only 3D detection methods inevitably suffer from the sparsity of point clouds.
We present a novel multi-modal framework SFD (Sparse Fuse Dense), which utilizes pseudo point clouds generated from depth completion.
Our method holds the highest entry on the KITTI car 3D object detection leaderboard, demonstrating the effectiveness of our SFD.
arXiv Detail & Related papers (2022-03-18T07:56:35Z) - CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D
Point Cloud Understanding [2.8661021832561757]
CrossPoint is a simple cross-modal contrastive learning approach to learn transferable 3D point cloud representations.
Our approach outperforms the previous unsupervised learning methods on a diverse range of downstream tasks including 3D object classification and segmentation.
arXiv Detail & Related papers (2022-03-01T18:59:01Z) - Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding [80.04281842702294]
We introduce the concept of the multi-view point cloud (Voint cloud) representing each 3D point as a set of features extracted from several view-points.
This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation.
We deploy a Voint neural network (VointNet) with a theoretically established functional form to learn representations in the Voint space.
arXiv Detail & Related papers (2021-11-30T13:08:19Z) - From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object
Detection [101.20784125067559]
We propose a new architecture, namely Hallucinated Hollow-3D R-CNN, to address the problem of 3D object detection.
In our approach, we first extract the multi-view features by sequentially projecting the point clouds into the perspective view and the bird-eye view.
The 3D objects are detected via a box refinement module with a novel Hierarchical Voxel RoI Pooling operation.
arXiv Detail & Related papers (2021-07-30T02:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.