Masked Scene Contrast: A Scalable Framework for Unsupervised 3D
Representation Learning
- URL: http://arxiv.org/abs/2303.14191v1
- Date: Fri, 24 Mar 2023 17:59:58 GMT
- Title: Masked Scene Contrast: A Scalable Framework for Unsupervised 3D
Representation Learning
- Authors: Xiaoyang Wu, Xin Wen, Xihui Liu, Hengshuang Zhao
- Abstract summary: Masked Scene Contrast (MSC) framework is capable of extracting comprehensive 3D representations more efficiently and effectively.
MSC also enables large-scale 3D pre-training across multiple datasets.
- Score: 37.155772047656114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a pioneering work, PointContrast conducts unsupervised 3D representation
learning via leveraging contrastive learning over raw RGB-D frames and proves
its effectiveness on various downstream tasks. However, the trend of
large-scale unsupervised learning in 3D has yet to emerge due to two stumbling
blocks: the inefficiency of matching RGB-D frames as contrastive views and the
annoying mode collapse phenomenon mentioned in previous works. Turning the two
stumbling blocks into empirical stepping stones, we first propose an efficient
and effective contrastive learning framework, which generates contrastive views
directly on scene-level point clouds by a well-curated data augmentation
pipeline and a practical view mixing strategy. Second, we introduce
reconstructive learning on the contrastive learning framework with an exquisite
design of contrastive cross masks, which targets the reconstruction of point
color and surfel normal. Our Masked Scene Contrast (MSC) framework is capable
of extracting comprehensive 3D representations more efficiently and
effectively. It accelerates the pre-training procedure by at least 3x and still
achieves an uncompromised performance compared with previous work. Besides, MSC
also enables large-scale 3D pre-training across multiple datasets, which
further boosts the performance and achieves state-of-the-art fine-tuning
results on several downstream tasks, e.g., 75.5% mIoU on ScanNet semantic
segmentation validation set.
Related papers
- Learning Robust 3D Representation from CLIP via Dual Denoising [4.230780744307392]
We propose Dual Denoising, a novel framework for learning robust and well-generalized 3D representations from CLIP.
It combines a denoising-based proxy task with a novel feature denoising network for 3D pre-training.
Experiments show that our model can effectively improve the representation learning performance and adversarial robustness of the 3D learning network.
arXiv Detail & Related papers (2024-07-01T02:15:03Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D
Action Representation Learning [33.68311764817763]
We propose Prompted Contrast with Masked Motion Modeling, PCM$rm 3$, for versatile 3D action representation learning.
Our method integrates the contrastive learning and masked prediction tasks in a mutually beneficial manner.
Tests on five downstream tasks under three large-scale datasets are conducted, demonstrating the superior generalization capacity of PCM$rm 3$ compared to the state-of-the-art works.
arXiv Detail & Related papers (2023-08-08T01:27:55Z) - Generalized 3D Self-supervised Learning Framework via Prompted
Foreground-Aware Feature Contrast [38.34558139249363]
We propose a general foreground-aware feature contrast FAC++ framework to learn more effective point cloud representations in pre-training.
We prevent over-discrimination between 3D segments/objects and encourage grouped foreground-to-background distinctions.
We show that our contrast pairs capture clear correspondences among foreground regions during pre-training.
arXiv Detail & Related papers (2023-03-11T11:42:01Z) - CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP [55.864132158596206]
Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning.
We make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding.
We propose CLIP2Scene, a framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network.
arXiv Detail & Related papers (2023-01-12T10:42:39Z) - PointACL:Adversarial Contrastive Learning for Robust Point Clouds
Representation under Adversarial Attack [73.3371797787823]
Adversarial contrastive learning (ACL) is considered an effective way to improve the robustness of pre-trained models.
We present our robust aware loss function to train self-supervised contrastive learning framework adversarially.
We validate our method, PointACL on downstream tasks, including 3D classification and 3D segmentation with multiple datasets.
arXiv Detail & Related papers (2022-09-14T22:58:31Z) - P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for
RGB-D Scene Understanding [24.93545970229774]
We propose contrasting "pairs of point-pixel pairs", where positives include pairs of RGB-D points in correspondence, and negatives include pairs where one of the two modalities has been disturbed.
This provides extra flexibility in making hard negatives and helps networks to learn features from both modalities, not just the more discriminating one of the two.
arXiv Detail & Related papers (2020-12-24T04:00:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.