Self-Supervised Representation Learning for RGB-D Salient Object
Detection
- URL: http://arxiv.org/abs/2101.12482v1
- Date: Fri, 29 Jan 2021 09:16:06 GMT
- Title: Self-Supervised Representation Learning for RGB-D Salient Object
Detection
- Authors: Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, Xiang Ruan
- Abstract summary: We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
- Score: 93.17479956795862
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing CNNs-Based RGB-D Salient Object Detection (SOD) networks are all
required to be pre-trained on the ImageNet to learn the hierarchy features
which can help to provide a good initialization. However, the collection and
annotation of large-scale datasets are time-consuming and expensive. In this
paper, we utilize Self-Supervised Representation Learning (SSL) to design two
pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and unlabeled RGB-D datasets to perform
pre-training, which make the network capture rich semantic contexts as well as
reduce the gap between two modalities, thereby providing an effective
initialization for the downstream task. In addition, for the inherent problem
of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion (MPF) module
that splits a single feature fusion into multi-path fusion to achieve an
adequate perception of consistent and differential information. The MPF module
is general and suitable for both cross-modal and cross-level feature fusion.
Extensive experiments on six benchmark RGB-D SOD datasets, our model
pre-trained on the RGB-D dataset ($6,335$ without any annotations) can perform
favorably against most state-of-the-art RGB-D methods pre-trained on ImageNet
($1,280,000$ with image-level annotations).
Related papers
- PointMBF: A Multi-scale Bidirectional Fusion Network for Unsupervised
RGB-D Point Cloud Registration [6.030097207369754]
We propose a network implementing multi-scale bidirectional fusion between RGB images and point clouds generated from depth images.
Our method achieves new state-of-the-art performance.
arXiv Detail & Related papers (2023-08-09T08:13:46Z) - CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets [50.6643933702394]
We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE.
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
arXiv Detail & Related papers (2023-02-13T07:09:45Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object
Detection [15.371153771528093]
We propose a novel Mutual-Transformer Fusion Network (MTFNet) for RGB-D SOD.
MTFNet contains two main modules, $i.e.$, Focal Feature Extractor (FFE) and Mutual-Transformer Fusion (MTF)
Comprehensive experimental results on six public benchmarks demonstrate the superiority of our proposed MTFNet.
arXiv Detail & Related papers (2021-12-02T12:48:37Z) - RGB-D Saliency Detection via Cascaded Mutual Information Minimization [122.8879596830581]
Existing RGB-D saliency detection models do not explicitly encourage RGB and depth to achieve effective multi-modal learning.
We introduce a novel multi-stage cascaded learning framework via mutual information minimization to "explicitly" model the multi-modal information between RGB image and depth data.
arXiv Detail & Related papers (2021-09-15T12:31:27Z) - Siamese Network for RGB-D Salient Object Detection and Beyond [113.30063105890041]
A novel framework is proposed to learn from both RGB and depth inputs through a shared network backbone.
Comprehensive experiments using five popular metrics show that the designed framework yields a robust RGB-D saliency detector.
We also link JL-DCF to the RGB-D semantic segmentation field, showing its capability of outperforming several semantic segmentation models.
arXiv Detail & Related papers (2020-08-26T06:01:05Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.