Leveraging the Third Dimension in Contrastive Learning
- URL: http://arxiv.org/abs/2301.11790v1
- Date: Fri, 27 Jan 2023 15:45:03 GMT
- Title: Leveraging the Third Dimension in Contrastive Learning
- Authors: Sumukh Aithal, Anirudh Goyal, Alex Lamb, Yoshua Bengio, Michael Mozer
- Abstract summary: Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks.
These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment.
We explore two distinct approaches to incorporating depth signals into the SSL framework.
- Score: 88.17394309208925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-Supervised Learning (SSL) methods operate on unlabeled data to learn
robust representations useful for downstream tasks. Most SSL methods rely on
augmentations obtained by transforming the 2D image pixel map. These
augmentations ignore the fact that biological vision takes place in an
immersive three-dimensional, temporally contiguous environment, and that
low-level biological vision relies heavily on depth cues. Using a signal
provided by a pretrained state-of-the-art monocular RGB-to-depth model (the
\emph{Depth Prediction Transformer}, Ranftl et al., 2021), we explore two
distinct approaches to incorporating depth signals into the SSL framework.
First, we evaluate contrastive learning using an RGB+depth input
representation. Second, we use the depth signal to generate novel views from
slightly different camera positions, thereby producing a 3D augmentation for
contrastive learning. We evaluate these two approaches on three different SSL
methods -- BYOL, SimSiam, and SwAV -- using ImageNette (10 class subset of
ImageNet), ImageNet-100 and ImageNet-1k datasets. We find that both approaches
to incorporating depth signals improve the robustness and generalization of the
baseline SSL methods, though the first approach (with depth-channel
concatenation) is superior. For instance, BYOL with the additional depth
channel leads to an increase in downstream classification accuracy from 85.3\%
to 88.0\% on ImageNette and 84.1\% to 87.0\% on ImageNet-C.
Related papers
- De-coupling and De-positioning Dense Self-supervised Learning [65.56679416475943]
Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects.
We show that they suffer from coupling and positional bias, which arise from the receptive field increasing with layer depth and zero-padding.
We demonstrate the benefits of our method on COCO and on a new challenging benchmark, OpenImage-MINI, for object classification, semantic segmentation, and object detection.
arXiv Detail & Related papers (2023-03-29T18:07:25Z) - LFM-3D: Learnable Feature Matching Across Wide Baselines Using 3D
Signals [9.201550006194994]
Learnable matchers often underperform when there exists only small regions of co-visibility between image pairs.
We propose LFM-3D, a Learnable Feature Matching framework that uses models based on graph neural networks.
We show that the resulting improved correspondences lead to much higher relative posing accuracy for in-the-wild image pairs.
arXiv Detail & Related papers (2023-03-22T17:46:27Z) - Learning 3D Representations from 2D Pre-trained Models via
Image-to-Point Masked Autoencoders [52.91248611338202]
We propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE.
By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding.
I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity.
arXiv Detail & Related papers (2022-12-13T17:59:20Z) - GraphCSPN: Geometry-Aware Depth Completion via Dynamic GCNs [49.55919802779889]
We propose a Graph Convolution based Spatial Propagation Network (GraphCSPN) as a general approach for depth completion.
In this work, we leverage convolution neural networks as well as graph neural networks in a complementary way for geometric representation learning.
Our method achieves the state-of-the-art performance, especially when compared in the case of using only a few propagation steps.
arXiv Detail & Related papers (2022-10-19T17:56:03Z) - Offline Visual Representation Learning for Embodied Navigation [50.442660137987275]
offline pretraining of visual representations with self-supervised learning (SSL)
Online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules.
arXiv Detail & Related papers (2022-04-27T23:22:43Z) - Unified Contrastive Learning in Image-Text-Label Space [130.31947133453406]
Unified Contrastive Learning (UniCL) is effective way of learning semantically rich yet discriminative representations.
UniCL stand-alone is a good learner on pure imagelabel data, rivaling the supervised learning methods across three image classification datasets.
arXiv Detail & Related papers (2022-04-07T17:34:51Z) - Pri3D: Can 3D Priors Help 2D Representation Learning? [37.35721274841419]
We introduce an approach to learn view-invariant,geometry-aware representations for network pre-training.
We employ contrastive learning under both multi-view im-age constraints and image-geometry constraints to encode3D priors into learned 2D representations.
arXiv Detail & Related papers (2021-04-22T17:59:30Z) - Towards Dense People Detection with Deep Learning and Depth images [9.376814409561726]
This paper proposes a DNN-based system that detects multiple people from a single depth image.
Our neural network processes a depth image and outputs a likelihood map in image coordinates.
We show this strategy to be effective, producing networks that generalize to work with scenes different from those used during training.
arXiv Detail & Related papers (2020-07-14T16:43:02Z) - DELTAS: Depth Estimation by Learning Triangulation And densification of
Sparse points [14.254472131009653]
Multi-view stereo (MVS) is the golden mean between the accuracy of active depth sensing and the practicality of monocular depth estimation.
Cost volume based approaches employing 3D convolutional neural networks (CNNs) have considerably improved the accuracy of MVS systems.
We propose an efficient depth estimation approach by first (a) detecting and evaluating descriptors for interest points, then (b) learning to match and triangulate a small set of interest points, and finally (c) densifying this sparse set of 3D points using CNNs.
arXiv Detail & Related papers (2020-03-19T17:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.