TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception
- URL: http://arxiv.org/abs/2412.03054v1
- Date: Wed, 04 Dec 2024 06:17:24 GMT
- Title: TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception
- Authors: Runjian Chen, Hyoungseob Park, Bo Zhang, Wenqi Shao, Ping Luo, Alex Wong,
- Abstract summary: TREND is the first work on temporal forecasting for unsupervised 3D representation learning.
We evaluate TREND on downstream 3D object detection tasks on popular datasets, including NuScenes, Once and NuScenes.
Experiment results show that TREND brings up to 90% more improvement as compared to previous SOTA unsupervised 3D pre-training methods.
- Score: 39.3873954435857
- License:
- Abstract: Labeling LiDAR point clouds is notoriously time-and-energy-consuming, which spurs recent unsupervised 3D representation learning methods to alleviate the labeling burden in LiDAR perception via pretrained weights. Almost all existing work focus on a single frame of LiDAR point cloud and neglect the temporal LiDAR sequence, which naturally accounts for object motion (and their semantics). Instead, we propose TREND, namely Temporal REndering with Neural fielD, to learn 3D representation via forecasting the future observation in an unsupervised manner. Unlike existing work that follows conventional contrastive learning or masked auto encoding paradigms, TREND integrates forecasting for 3D pre-training through a Recurrent Embedding scheme to generate 3D embedding across time and a Temporal Neural Field to represent the 3D scene, through which we compute the loss using differentiable rendering. To our best knowledge, TREND is the first work on temporal forecasting for unsupervised 3D representation learning. We evaluate TREND on downstream 3D object detection tasks on popular datasets, including NuScenes, Once and Waymo. Experiment results show that TREND brings up to 90% more improvement as compared to previous SOTA unsupervised 3D pre-training methods and generally improve different downstream models across datasets, demonstrating that indeed temporal forecasting brings improvement for LiDAR perception. Codes and models will be released.
Related papers
- Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection [52.66283064389691]
State-of-the-art 3D object detectors are often trained on massive labeled datasets.
Recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels.
We propose a shelf-supervised approach for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data.
arXiv Detail & Related papers (2024-06-14T15:21:57Z) - OccFlowNet: Towards Self-supervised Occupancy Estimation via
Differentiable Rendering and Occupancy Flow [0.6577148087211809]
We present a novel approach to occupancy estimation inspired by neural radiance field (NeRF) using only 2D labels.
We employ differentiable volumetric rendering to predict depth and semantic maps and train a 3D network based on 2D supervision only.
arXiv Detail & Related papers (2024-02-20T08:04:12Z) - OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision.
We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range.
For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z) - SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations [76.45009891152178]
Pretraining-finetuning approach can alleviate the labeling burden by fine-tuning a pre-trained backbone across various downstream datasets as well as tasks.
We show, for the first time, that general representations learning can be achieved through the task of occupancy prediction.
Our findings will facilitate the understanding of LiDAR points and pave the way for future advancements in LiDAR pre-training.
arXiv Detail & Related papers (2023-09-19T11:13:01Z) - 3D Object Detection with a Self-supervised Lidar Scene Flow Backbone [10.341296683155973]
We propose using a self-supervised training strategy to learn a general point cloud backbone model for downstream 3D vision tasks.
Our main contribution leverages learned flow and motion representations and combines a self-supervised backbone with a 3D detection head.
Experiments on KITTI and nuScenes benchmarks show that the proposed self-supervised pre-training increases 3D detection performance significantly.
arXiv Detail & Related papers (2022-05-02T07:53:29Z) - Self-supervised Point Cloud Prediction Using 3D Spatio-temporal
Convolutional Networks [27.49539859498477]
Exploiting past 3D LiDAR scans to predict future point clouds is a promising method for autonomous mobile systems.
We propose an end-to-end approach that exploits a 2D range image representation of each 3D LiDAR scan.
We develop an encoder-decoder architecture using 3D convolutions to jointly aggregate spatial and temporal information of the scene.
arXiv Detail & Related papers (2021-09-28T19:58:13Z) - Lifting 2D Object Locations to 3D by Discounting LiDAR Outliers across
Objects and Views [70.1586005070678]
We present a system for automatically converting 2D mask object predictions and raw LiDAR point clouds into full 3D bounding boxes of objects.
Our method significantly outperforms previous work despite the fact that those methods use significantly more complex pipelines, 3D models and additional human-annotated external sources of prior information.
arXiv Detail & Related papers (2021-09-16T13:01:13Z) - Spatio-temporal Self-Supervised Representation Learning for 3D Point
Clouds [96.9027094562957]
We introduce a-temporal representation learning framework, capable of learning from unlabeled tasks.
Inspired by how infants learn from visual data in the wild, we explore rich cues derived from the 3D data.
STRL takes two temporally-related frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly.
arXiv Detail & Related papers (2021-09-01T04:17:11Z) - 3DMotion-Net: Learning Continuous Flow Function for 3D Motion Prediction [12.323767993152968]
We deal with the problem to predict the future 3D motions of 3D object scans from previous two consecutive frames.
We propose a self-supervised approach that leverages the power of the deep neural network to learn a continuous flow function of 3D point clouds.
We perform extensive experiments on D-FAUST, SCAPE and TOSCA benchmark data sets and the results demonstrate that our approach is capable of handling temporally inconsistent input.
arXiv Detail & Related papers (2020-06-24T17:39:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.