Self-supervised Feature Learning by Cross-modality and Cross-view
Correspondences
- URL: http://arxiv.org/abs/2004.05749v1
- Date: Mon, 13 Apr 2020 02:57:25 GMT
- Title: Self-supervised Feature Learning by Cross-modality and Cross-view
Correspondences
- Authors: Longlong Jing, Yucheng Chen, Ling Zhang, Mingyi He, Yingli Tian
- Abstract summary: This paper presents a novel self-supervised learning approach to learn both 2D image features and 3D point cloud features.
It exploits cross-modality and cross-view correspondences without using any annotated human labels.
The effectiveness of the learned 2D and 3D features is evaluated by transferring them on five different tasks.
- Score: 32.01548991331616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of supervised learning requires large-scale ground truth labels
which are very expensive, time-consuming, or may need special skills to
annotate. To address this issue, many self- or un-supervised methods are
developed. Unlike most existing self-supervised methods to learn only 2D image
features or only 3D point cloud features, this paper presents a novel and
effective self-supervised learning approach to jointly learn both 2D image
features and 3D point cloud features by exploiting cross-modality and
cross-view correspondences without using any human annotated labels.
Specifically, 2D image features of rendered images from different views are
extracted by a 2D convolutional neural network, and 3D point cloud features are
extracted by a graph convolution neural network. Two types of features are fed
into a two-layer fully connected neural network to estimate the cross-modality
correspondence. The three networks are jointly trained (i.e. cross-modality) by
verifying whether two sampled data of different modalities belong to the same
object, meanwhile, the 2D convolutional neural network is additionally
optimized through minimizing intra-object distance while maximizing
inter-object distance of rendered images in different views (i.e. cross-view).
The effectiveness of the learned 2D and 3D features is evaluated by
transferring them on five different tasks including multi-view 2D shape
recognition, 3D shape recognition, multi-view 2D shape retrieval, 3D shape
retrieval, and 3D part-segmentation. Extensive evaluations on all the five
different tasks across different datasets demonstrate strong generalization and
effectiveness of the learned 2D and 3D features by the proposed self-supervised
method.
Related papers
- OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation [67.56268991234371]
OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6% on average.
Code and pre-trained models will be released later.
arXiv Detail & Related papers (2024-03-28T17:05:04Z) - Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration [107.61458720202984]
This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes.
We propose the learnable transformation alignment to bridge the domain gap between image and point cloud data.
We establish dense 2D-3D correspondences to estimate the rigid pose.
arXiv Detail & Related papers (2024-01-23T02:41:06Z) - Multi-View Representation is What You Need for Point-Cloud Pre-Training [22.55455166875263]
This paper proposes a novel approach to point-cloud pre-training that learns 3D representations by leveraging pre-trained 2D networks.
We train the 3D feature extraction network with the help of the novel 2D knowledge transfer loss.
Experimental results demonstrate that our pre-trained model can be successfully transferred to various downstream tasks.
arXiv Detail & Related papers (2023-06-05T03:14:54Z) - Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic
Segmentation [17.557697146752652]
2D & 3D semantic segmentation has become mainstream in 3D scene understanding.
It still remains elusive how to fuse and process the cross-dimensional features from these two distinct spaces.
In this paper, we argue that despite its simplicity, projecting unidirectionally multi-view 2D deep semantic features into the 3D space aligned with 3D deep semantic features could lead to better feature fusion.
arXiv Detail & Related papers (2022-12-13T15:58:25Z) - MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D
Segmentation [91.6658845016214]
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks.
We render a 3D shape from multiple views, and set up a dense correspondence learning task within the contrastive learning framework.
As a result, the learned 2D representations are view-invariant and geometrically consistent.
arXiv Detail & Related papers (2022-08-18T00:48:15Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D
Point Cloud Understanding [2.8661021832561757]
CrossPoint is a simple cross-modal contrastive learning approach to learn transferable 3D point cloud representations.
Our approach outperforms the previous unsupervised learning methods on a diverse range of downstream tasks including 3D object classification and segmentation.
arXiv Detail & Related papers (2022-03-01T18:59:01Z) - Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining [21.878815180924832]
We present a novel 3D pretraining method by leveraging 2D networks learned from rich 2D datasets.
Our experiments show that the 3D models pretrained with 2D knowledge boost the performances across various real-world 3D downstream tasks.
arXiv Detail & Related papers (2021-04-10T05:40:42Z) - Learning Joint 2D-3D Representations for Depth Completion [90.62843376586216]
We design a simple yet effective neural network block that learns to extract joint 2D and 3D features.
Specifically, the block consists of two domain-specific sub-networks that apply 2D convolution on image pixels and continuous convolution on 3D points.
arXiv Detail & Related papers (2020-12-22T22:58:29Z) - Learning 2D-3D Correspondences To Solve The Blind Perspective-n-Point
Problem [98.92148855291363]
This paper proposes a deep CNN model which simultaneously solves for both 6-DoF absolute camera pose 2D--3D correspondences.
Tests on both real and simulated data have shown that our method substantially outperforms existing approaches.
arXiv Detail & Related papers (2020-03-15T04:17:30Z) - Pointwise Attention-Based Atrous Convolutional Neural Networks [15.499267533387039]
A pointwise attention-based atrous convolutional neural network architecture is proposed to efficiently deal with a large number of points.
The proposed model has been evaluated on the two most important 3D point cloud datasets for the 3D semantic segmentation task.
It achieves a reasonable performance compared to state-of-the-art models in terms of accuracy, with a much smaller number of parameters.
arXiv Detail & Related papers (2019-12-27T13:12:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.