Self-Supervised Image-to-Point Distillation via Semantically Tolerant
Contrastive Loss
- URL: http://arxiv.org/abs/2301.05709v2
- Date: Fri, 24 Mar 2023 15:53:21 GMT
- Title: Self-Supervised Image-to-Point Distillation via Semantically Tolerant
Contrastive Loss
- Authors: Anas Mahmoud, Jordan S. K. Hu, Tianshu Kuai, Ali Harakeh, Liam Paull,
and Steven L. Waslander
- Abstract summary: We propose a novel semantically tolerant image-to-point contrastive loss that takes into consideration the semantic distance between positive and negative image regions.
Our method consistently outperforms state-of-the-art 2D-to-3D representation learning frameworks across a wide range of 2D self-supervised pretrained models.
- Score: 18.485918870427327
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An effective framework for learning 3D representations for perception tasks
is distilling rich self-supervised image features via contrastive learning.
However, image-to point representation learning for autonomous driving datasets
faces two main challenges: 1) the abundance of self-similarity, which results
in the contrastive losses pushing away semantically similar point and image
regions and thus disturbing the local semantic structure of the learned
representations, and 2) severe class imbalance as pretraining gets dominated by
over-represented classes. We propose to alleviate the self-similarity problem
through a novel semantically tolerant image-to-point contrastive loss that
takes into consideration the semantic distance between positive and negative
image regions to minimize contrasting semantically similar point and image
regions. Additionally, we address class imbalance by designing a class-agnostic
balanced loss that approximates the degree of class imbalance through an
aggregate sample-to-samples semantic similarity measure. We demonstrate that
our semantically-tolerant contrastive loss with class balancing improves
state-of-the art 2D-to-3D representation learning in all evaluation settings on
3D semantic segmentation. Our method consistently outperforms state-of-the-art
2D-to-3D representation learning frameworks across a wide range of 2D
self-supervised pretrained models.
Related papers
- Image-to-Lidar Relational Distillation for Autonomous Driving Data [4.893568782260855]
2D foundation models excel at addressing 2D tasks with little or no downstream supervision, owing to their robust representations.
The emergence of 2D-to-3D distillation frameworks has extended these capabilities to 3D models.
But distilling 3D representations for autonomous driving datasets presents challenges like self-similarity, class imbalance, and point cloud sparsity.
We propose a relational distillation framework enforcing intra-modal and cross-modal constraints, resulting in distilled 3D representations that closely capture the structure of the 2D representation.
arXiv Detail & Related papers (2024-09-01T21:26:32Z) - Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining [41.145598142457686]
LiDAR-camera 3D representation pretraining has shown significant promise for 3D perception tasks and related applications.
We propose a novel Vision-Foundation-Model-driven sample exploring module to meticulously select LiDAR-Image pairs from unexplored frames.
Our method consistently outperforms existing state-of-the-art pretraining frameworks across three major public autonomous driving datasets.
arXiv Detail & Related papers (2024-07-10T08:46:29Z) - Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models [55.99654128127689]
Visual Foundation Models (VFMs) are used to enhance 3D representation learning.
VFMs generate semantic labels for weakly-supervised pixel-to-point contrastive distillation.
We adapt sampling probabilities of points to address imbalances in spatial distribution and category frequency.
arXiv Detail & Related papers (2024-05-23T07:48:19Z) - Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration [107.61458720202984]
This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes.
We propose the learnable transformation alignment to bridge the domain gap between image and point cloud data.
We establish dense 2D-3D correspondences to estimate the rigid pose.
arXiv Detail & Related papers (2024-01-23T02:41:06Z) - Unsupervised Feature Clustering Improves Contrastive Representation
Learning for Medical Image Segmentation [18.75543045234889]
Self-supervised instance discrimination is an effective contrastive pretext task to learn feature representations and address limited medical image annotations.
We propose a new self-supervised contrastive learning method that uses unsupervised feature clustering to better select positive and negative image samples.
Our method outperforms state-of-the-art self-supervised contrastive techniques on these tasks.
arXiv Detail & Related papers (2022-11-15T22:54:29Z) - RiCS: A 2D Self-Occlusion Map for Harmonizing Volumetric Objects [68.85305626324694]
Ray-marching in Camera Space (RiCS) is a new method to represent the self-occlusions of foreground objects in 3D into a 2D self-occlusion map.
We show that our representation map not only allows us to enhance the image quality but also to model temporally coherent complex shadow effects.
arXiv Detail & Related papers (2022-05-14T05:35:35Z) - Non-Local Latent Relation Distillation for Self-Adaptive 3D Human Pose
Estimation [63.199549837604444]
3D human pose estimation approaches leverage different forms of strong (2D/3D pose) or weak (multi-view or depth) paired supervision.
We cast 3D pose learning as a self-supervised adaptation problem that aims to transfer the task knowledge from a labeled source domain to a completely unpaired target.
We evaluate different self-adaptation settings and demonstrate state-of-the-art 3D human pose estimation performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-05T03:52:57Z) - Self-Supervised Image Representation Learning with Geometric Set
Consistency [50.12720780102395]
We propose a method for self-supervised image representation learning under the guidance of 3D geometric consistency.
Specifically, we introduce 3D geometric consistency into a contrastive learning framework to enforce the feature consistency within image views.
arXiv Detail & Related papers (2022-03-29T08:57:33Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.