Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning
- URL: http://arxiv.org/abs/2407.05862v1
- Date: Mon, 8 Jul 2024 12:28:56 GMT
- Title: Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning
- Authors: Bin Ren, Guofeng Mei, Danda Pani Paudel, Weijie Wang, Yawei Li, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Nicu Sebe,
- Abstract summary: Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
- Score: 116.75939193785143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. However, in 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant. This raises the question: Can we take the best of both worlds? To answer this question, we first empirically validate that integrating MAE-based point cloud pre-training with the standard contrastive learning paradigm, even with meticulous design, can lead to a decrease in performance. To address this limitation, we reintroduce CL into the MAE-based point cloud pre-training paradigm by leveraging the inherent contrastive properties of MAE. Specifically, rather than relying on extensive data augmentation as commonly used in the image domain, we randomly mask the input tokens twice to generate contrastive input pairs. Subsequently, a weight-sharing encoder and two identically structured decoders are utilized to perform masked token reconstruction. Additionally, we propose that for an input token masked by both masks simultaneously, the reconstructed features should be as similar as possible. This naturally establishes an explicit contrastive constraint within the generative MAE-based pre-training paradigm, resulting in our proposed method, Point-CMAE. Consequently, Point-CMAE effectively enhances the representation quality and transfer performance compared to its MAE counterpart. Experimental evaluations across various downstream applications, including classification, part segmentation, and few-shot learning, demonstrate the efficacy of our framework in surpassing state-of-the-art techniques under standard ViTs and single-modal settings. The source code and trained models are available at: https://github.com/Amazingren/Point-CMAE.
Related papers
- Enhancing Vision-Language Model with Unmasked Token Alignment [37.12838142681491]
This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations.
UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder.
arXiv Detail & Related papers (2024-05-29T11:48:17Z) - MaeFuse: Transferring Omni Features with Pretrained Masked Autoencoders for Infrared and Visible Image Fusion via Guided Training [57.18758272617101]
MaeFuse is a novel autoencoder model designed for infrared and visible image fusion (IVIF)
Our model utilizes a pretrained encoder from Masked Autoencoders (MAE), which facilities the omni features extraction for low-level reconstruction and high-level vision tasks.
MaeFuse not only introduces a novel perspective in the realm of fusion techniques but also stands out with impressive performance across various public datasets.
arXiv Detail & Related papers (2024-04-17T02:47:39Z) - M$^3$CS: Multi-Target Masked Point Modeling with Learnable Codebook and
Siamese Decoders [19.68592678093725]
Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds.
M$3$CS is proposed to enable the model with the above abilities.
arXiv Detail & Related papers (2023-09-23T02:19:21Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - PointCMP: Contrastive Mask Prediction for Self-supervised Learning on
Point Cloud Videos [58.18707835387484]
We propose a contrastive mask prediction framework for self-supervised learning on point cloud videos.
PointCMP employs a two-branch structure to achieve simultaneous learning of both local and globaltemporal information.
Our framework achieves the state-of-the-art performance on benchmark datasets and outperforms existing full-supervised counterparts.
arXiv Detail & Related papers (2023-05-06T15:47:48Z) - Masked Contrastive Representation Learning [6.737710830712818]
This work presents Masked Contrastive Representation Learning (MACRL) for self-supervised visual pre-training.
We adopt an asymmetric setting for the siamese network (i.e., encoder-decoder structure in both branches), where one branch with higher mask ratio and stronger data augmentation, while the other adopts weaker data corruptions.
In our experiments, MACRL presents superior results on various vision benchmarks, including CIFAR-10, CIFAR-100, Tiny-ImageNet, and two other ImageNet subsets.
arXiv Detail & Related papers (2022-11-11T05:32:28Z) - i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? [26.146459754995597]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain.
This paper aims to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability.
In addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space.
arXiv Detail & Related papers (2022-10-20T17:59:54Z) - Contrastive Masked Autoencoders are Stronger Vision Learners [114.16568579208216]
Contrastive Masked Autoencoders (CMAE) is a new self-supervised pre-training method for learning more comprehensive and capable vision representations.
CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-07-27T14:04:22Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.