Spatio-temporal Self-Supervised Representation Learning for 3D Point
Clouds
- URL: http://arxiv.org/abs/2109.00179v1
- Date: Wed, 1 Sep 2021 04:17:11 GMT
- Title: Spatio-temporal Self-Supervised Representation Learning for 3D Point
Clouds
- Authors: Siyuan Huang, Yichen Xie, Song-Chun Zhu, Yixin Zhu
- Abstract summary: We introduce a-temporal representation learning framework, capable of learning from unlabeled tasks.
Inspired by how infants learn from visual data in the wild, we explore rich cues derived from the 3D data.
STRL takes two temporally-related frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly.
- Score: 96.9027094562957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To date, various 3D scene understanding tasks still lack practical and
generalizable pre-trained models, primarily due to the intricate nature of 3D
scene understanding tasks and their immense variations introduced by camera
views, lighting, occlusions, etc. In this paper, we tackle this challenge by
introducing a spatio-temporal representation learning (STRL) framework, capable
of learning from unlabeled 3D point clouds in a self-supervised fashion.
Inspired by how infants learn from visual data in the wild, we explore the rich
spatio-temporal cues derived from the 3D data. Specifically, STRL takes two
temporally-correlated frames from a 3D point cloud sequence as the input,
transforms it with the spatial data augmentation, and learns the invariant
representation self-supervisedly. To corroborate the efficacy of STRL, we
conduct extensive experiments on three types (synthetic, indoor, and outdoor)
of datasets. Experimental results demonstrate that, compared with supervised
learning methods, the learned self-supervised representation facilitates
various models to attain comparable or even better performances while capable
of generalizing pre-trained models to downstream tasks, including 3D shape
classification, 3D object detection, and 3D semantic segmentation. Moreover,
the spatio-temporal contextual cues embedded in 3D point clouds significantly
improve the learned representations.
Related papers
- Learning 3D Representations from Procedural 3D Programs [6.915871213703219]
Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds.
We propose learning 3D representations from procedural 3D programs that automatically generate 3D shapes using simple primitives and augmentations.
arXiv Detail & Related papers (2024-11-25T18:59:57Z) - GS-PT: Exploiting 3D Gaussian Splatting for Comprehensive Point Cloud Understanding via Self-supervised Learning [15.559369116540097]
Self-supervised learning of point cloud aims to leverage unlabeled 3D data to learn meaningful representations without reliance on manual annotations.
We propose GS-PT, which integrates 3D Gaussian Splatting (3DGS) into point cloud self-supervised learning for the first time.
Our pipeline utilizes transformers as the backbone for self-supervised pre-training and introduces novel contrastive learning tasks through 3DGS.
arXiv Detail & Related papers (2024-09-08T03:46:47Z) - 4D Contrastive Superflows are Dense 3D Representation Learners [62.433137130087445]
We introduce SuperFlow, a novel framework designed to harness consecutive LiDAR-camera pairs for establishing pretraining objectives.
To further boost learning efficiency, we incorporate a plug-and-play view consistency module that enhances alignment of the knowledge distilled from camera views.
arXiv Detail & Related papers (2024-07-08T17:59:54Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation [67.56268991234371]
OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6% on average.
Code and pre-trained models will be released later.
arXiv Detail & Related papers (2024-03-28T17:05:04Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - 3D Object Detection with a Self-supervised Lidar Scene Flow Backbone [10.341296683155973]
We propose using a self-supervised training strategy to learn a general point cloud backbone model for downstream 3D vision tasks.
Our main contribution leverages learned flow and motion representations and combines a self-supervised backbone with a 3D detection head.
Experiments on KITTI and nuScenes benchmarks show that the proposed self-supervised pre-training increases 3D detection performance significantly.
arXiv Detail & Related papers (2022-05-02T07:53:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.