P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for
RGB-D Scene Understanding
- URL: http://arxiv.org/abs/2012.13089v1
- Date: Thu, 24 Dec 2020 04:00:52 GMT
- Title: P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for
RGB-D Scene Understanding
- Authors: Yunze Liu, Li Yi, Shanghang Zhang, Qingnan Fan, Thomas Funkhouser, Hao
Dong
- Abstract summary: We propose contrasting "pairs of point-pixel pairs", where positives include pairs of RGB-D points in correspondence, and negatives include pairs where one of the two modalities has been disturbed.
This provides extra flexibility in making hard negatives and helps networks to learn features from both modalities, not just the more discriminating one of the two.
- Score: 24.93545970229774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised representation learning is a critical problem in computer
vision, as it provides a way to pretrain feature extractors on large unlabeled
datasets that can be used as an initialization for more efficient and effective
training on downstream tasks. A promising approach is to use contrastive
learning to learn a latent space where features are close for similar data
samples and far apart for dissimilar ones. This approach has demonstrated
tremendous success for pretraining both image and point cloud feature
extractors, but it has been barely investigated for multi-modal RGB-D scans,
especially with the goal of facilitating high-level scene understanding. To
solve this problem, we propose contrasting "pairs of point-pixel pairs", where
positives include pairs of RGB-D points in correspondence, and negatives
include pairs where one of the two modalities has been disturbed and/or the two
RGB-D points are not in correspondence. This provides extra flexibility in
making hard negatives and helps networks to learn features from both
modalities, not just the more discriminating one of the two. Experiments show
that this proposed approach yields better performance on three large-scale
RGB-D scene understanding benchmarks (ScanNet, SUN RGB-D, and 3RScan) than
previous pretraining approaches.
Related papers
- SEMPose: A Single End-to-end Network for Multi-object Pose Estimation [13.131534219937533]
SEMPose is an end-to-end multi-object pose estimation network.
It can perform inference at 32 FPS without requiring inputs other than the RGB image.
It can accurately estimate the poses of multiple objects in real time, with inference time unaffected by the number of target objects.
arXiv Detail & Related papers (2024-11-21T10:37:54Z) - Towards Human-Level 3D Relative Pose Estimation: Generalizable, Training-Free, with Single Reference [62.99706119370521]
Humans can easily deduce the relative pose of an unseen object, without label/training, given only a single query-reference image pair.
We propose a novel 3D generalizable relative pose estimation method by elaborating (i) with a 2.5D shape from an RGB-D reference, (ii) with an off-the-shelf differentiable, and (iii) with semantic cues from a pretrained model like DINOv2.
arXiv Detail & Related papers (2024-06-26T16:01:10Z) - Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration [107.61458720202984]
This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes.
We propose the learnable transformation alignment to bridge the domain gap between image and point cloud data.
We establish dense 2D-3D correspondences to estimate the rigid pose.
arXiv Detail & Related papers (2024-01-23T02:41:06Z) - PointMBF: A Multi-scale Bidirectional Fusion Network for Unsupervised
RGB-D Point Cloud Registration [6.030097207369754]
We propose a network implementing multi-scale bidirectional fusion between RGB images and point clouds generated from depth images.
Our method achieves new state-of-the-art performance.
arXiv Detail & Related papers (2023-08-09T08:13:46Z) - Masked Scene Contrast: A Scalable Framework for Unsupervised 3D
Representation Learning [37.155772047656114]
Masked Scene Contrast (MSC) framework is capable of extracting comprehensive 3D representations more efficiently and effectively.
MSC also enables large-scale 3D pre-training across multiple datasets.
arXiv Detail & Related papers (2023-03-24T17:59:58Z) - DPODv2: Dense Correspondence-Based 6 DoF Pose Estimation [24.770767430749288]
We propose a 3 stage 6 DoF object detection method called DPODv2 (Dense Pose Object Detector)
We combine a 2D object detector with a dense correspondence estimation network and a multi-view pose refinement method to estimate a full 6 DoF pose.
DPODv2 achieves excellent results on all of them while still remaining fast and scalable independent of the used data modality and the type of training data.
arXiv Detail & Related papers (2022-07-06T16:48:56Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - Accurate RGB-D Salient Object Detection via Collaborative Learning [101.82654054191443]
RGB-D saliency detection shows impressive ability on some challenge scenarios.
We propose a novel collaborative learning framework where edge, depth and saliency are leveraged in a more efficient way.
arXiv Detail & Related papers (2020-07-23T04:33:36Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.