CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets
- URL: http://arxiv.org/abs/2302.06148v1
- Date: Mon, 13 Feb 2023 07:09:45 GMT
- Title: CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets
- Authors: Jiange Yang, Sheng Guo, Gangshan Wu, Limin Wang
- Abstract summary: We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE.
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
- Score: 50.6643933702394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current RGB-D scene recognition approaches often train two standalone
backbones for RGB and depth modalities with the same Places or ImageNet
pre-training. However, the pre-trained depth network is still biased by
RGB-based models which may result in a suboptimal solution. In this paper, we
present a single-model self-supervised hybrid pre-training framework for RGB
and depth modalities, termed as CoMAE. Our CoMAE presents a curriculum learning
strategy to unify the two popular self-supervised representation learning
algorithms: contrastive learning and masked image modeling. Specifically, we
first build a patch-level alignment task to pre-train a single encoder shared
by two modalities via cross-modal contrastive learning. Then, the pre-trained
contrastive encoder is passed to a multi-modal masked autoencoder to capture
the finer context features from a generative perspective. In addition, our
single-model design without requirement of fusion module is very flexible and
robust to generalize to unimodal scenario in both training and testing phases.
Extensive experiments on SUN RGB-D and NYUDv2 datasets demonstrate the
effectiveness of our CoMAE for RGB and depth representation learning. In
addition, our experiment results reveal that CoMAE is a data-efficient
representation learner. Although we only use the small-scale and unlabeled
training set for pre-training, our CoMAE pre-trained models are still
competitive to the state-of-the-art methods with extra large-scale and
supervised RGB dataset pre-training. Code will be released at
https://github.com/MCG-NJU/CoMAE.
Related papers
- Consistent Multimodal Generation via A Unified GAN Framework [36.08519541540843]
We investigate how to generate multimodal image outputs, such as RGB, depth, and surface normals, with a single generative model.
Our solution builds on the StyleGAN3 architecture, with a shared backbone and modality-specific branches in the last layers of the synthesis network.
In experiments on the Stanford2D3D dataset, we demonstrate realistic and consistent generation of RGB, depth, and normal images.
arXiv Detail & Related papers (2023-07-04T01:33:20Z) - FastMIM: Expediting Masked Image Modeling Pre-training for Vision [65.47756720190155]
FastMIM is a framework for pre-training vision backbones with low-resolution input images.
It reconstructs Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images.
It can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
arXiv Detail & Related papers (2022-12-13T14:09:32Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers [0.0]
We propose a recipe for transferring pretrained ViTs in RGB-D domains for single-view 3D object recognition.
We show that our adapted ViTs score up to 95.1% top-1 accuracy in Washington, achieving new state-of-the-art results in this benchmark.
arXiv Detail & Related papers (2022-10-03T12:08:09Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Self-Supervised Modality-Aware Multiple Granularity Pre-Training for
RGB-Infrared Person Re-Identification [9.624510941236837]
Modality-Aware Multiple Granularity Learning (MMGL) is a self-supervised pre-training alternative to ImageNet pre-training.
MMGL learns better representations (+6.47% Rank-1) with faster training speed (converge in few hours) and solider data efficiency (5% data size) than ImageNet pre-training.
Results suggest it generalizes well to various existing models, losses and has promising transferability across datasets.
arXiv Detail & Related papers (2021-12-12T04:40:33Z) - RGB-D Saliency Detection via Cascaded Mutual Information Minimization [122.8879596830581]
Existing RGB-D saliency detection models do not explicitly encourage RGB and depth to achieve effective multi-modal learning.
We introduce a novel multi-stage cascaded learning framework via mutual information minimization to "explicitly" model the multi-modal information between RGB image and depth data.
arXiv Detail & Related papers (2021-09-15T12:31:27Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.