PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D
Object Detection
- URL: http://arxiv.org/abs/2303.08129v1
- Date: Tue, 14 Mar 2023 17:58:03 GMT
- Title: PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D
Object Detection
- Authors: Anthony Chen, Kevin Zhang, Renrui Zhang, Zihan Wang, Yuheng Lu,
Yandong Guo, Shanghang Zhang
- Abstract summary: Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities.
In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world.
We propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects.
- Score: 26.03582038710992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked Autoencoders learn strong visual representations and achieve
state-of-the-art results in several independent modalities, yet very few works
have addressed their capabilities in multi-modality settings. In this work, we
focus on point cloud and RGB image data, two modalities that are often
presented together in the real world, and explore their meaningful
interactions. To improve upon the cross-modal synergy in existing works, we
propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D
interaction through three aspects. Specifically, we first notice the importance
of masking strategies between the two sources and utilize a projection module
to complementarily align the mask and visible tokens of the two modalities.
Then, we utilize a well-crafted two-branch MAE pipeline with a novel shared
decoder to promote cross-modality interaction in the mask tokens. Finally, we
design a unique cross-modal reconstruction module to enhance representation
learning for both modalities. Through extensive experiments performed on
large-scale RGB-D scene understanding benchmarks (SUN RGB-D and ScannetV2), we
discover it is nontrivial to interactively learn point-image features, where we
greatly improve multiple 3D detectors, 2D detectors, and few-shot classifiers
by 2.9%, 6.7%, and 2.4%, respectively. Code is available at
https://github.com/BLVLab/PiMAE.
Related papers
- Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval [5.965791109321719]
Cross-modal hashing between 2D images and 3D point-cloud data is a growing concern in real-world retrieval systems.
We propose contrastive masked autoencoders based self-supervised hashing (CMAH) for retrieval between images and point-cloud data.
arXiv Detail & Related papers (2024-08-11T07:03:21Z) - M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for
2D image and video understanding [5.989397492717352]
We present M$3$3D ($underlineM$ulti-$underlineM$odal $underlineM$asked $underline3D$) built based on Multi-modal masked autoencoders.
We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning.
Experiments show that M$3$3D outperforms the existing state-of-the-art approaches on ScanNet, NYUv2, UCF-101 and OR-AR.
arXiv Detail & Related papers (2023-09-26T23:52:09Z) - UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving [47.590099762244535]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks.
This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving.
To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$2$AE.
arXiv Detail & Related papers (2023-08-21T02:13:40Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - Unleash the Potential of Image Branch for Cross-modal 3D Object
Detection [67.94357336206136]
We present a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects.
First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation.
Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch.
arXiv Detail & Related papers (2023-01-22T08:26:58Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - DetMatch: Two Teachers are Better Than One for Joint 2D and 3D
Semi-Supervised Object Detection [29.722784254501768]
DetMatch is a flexible framework for joint semi-supervised learning on 2D and 3D modalities.
By identifying objects detected in both sensors, our pipeline generates a cleaner, more robust set of pseudo-labels.
We leverage the richer semantics of RGB images to rectify incorrect 3D class predictions and improve localization of 3D boxes.
arXiv Detail & Related papers (2022-03-17T17:58:00Z) - Learning Joint 2D-3D Representations for Depth Completion [90.62843376586216]
We design a simple yet effective neural network block that learns to extract joint 2D and 3D features.
Specifically, the block consists of two domain-specific sub-networks that apply 2D convolution on image pixels and continuous convolution on 3D points.
arXiv Detail & Related papers (2020-12-22T22:58:29Z) - Cross-Modality 3D Object Detection [63.29935886648709]
We present a novel two-stage multi-modal fusion network for 3D object detection.
The whole architecture facilitates two-stage fusion.
Our experiments on the KITTI dataset show that the proposed multi-stage fusion helps the network to learn better representations.
arXiv Detail & Related papers (2020-08-16T11:01:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.