Sonata: Self-Supervised Learning of Reliable Point Representations
- URL: http://arxiv.org/abs/2503.16429v1
- Date: Thu, 20 Mar 2025 17:59:59 GMT
- Title: Sonata: Self-Supervised Learning of Reliable Point Representations
- Authors: Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard Newcombe, Hengshuang Zhao, Julian Straub,
- Abstract summary: We find that existing 3D self-supervised learning approaches fall short when evaluated on representation quality through linear probing.<n>This challenge is unique to 3D and arises from the sparse nature of point cloud data.<n>We address it through two key strategies: obscuring spatial information and enhancing the reliance on input features.
- Score: 29.931666371580178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we question whether we have a reliable self-supervised point cloud model that can be used for diverse 3D tasks via simple linear probing, even with limited data and minimal computation. We find that existing 3D self-supervised learning approaches fall short when evaluated on representation quality through linear probing. We hypothesize that this is due to what we term the "geometric shortcut", which causes representations to collapse to low-level spatial features. This challenge is unique to 3D and arises from the sparse nature of point cloud data. We address it through two key strategies: obscuring spatial information and enhancing the reliance on input features, ultimately composing a Sonata of 140k point clouds through self-distillation. Sonata is simple and intuitive, yet its learned representations are strong and reliable: zero-shot visualizations demonstrate semantic grouping, alongside strong spatial reasoning through nearest-neighbor relationships. Sonata demonstrates exceptional parameter and data efficiency, tripling linear probing accuracy (from 21.8% to 72.5%) on ScanNet and nearly doubling performance with only 1% of the data compared to previous approaches. Full fine-tuning further advances SOTA across both 3D indoor and outdoor perception tasks.
Related papers
- Clustering based Point Cloud Representation Learning for 3D Analysis [80.88995099442374]
We propose a clustering based supervised learning scheme for point cloud analysis.
Unlike current de-facto, scene-wise training paradigm, our algorithm conducts within-class clustering on the point embedding space.
Our algorithm shows notable improvements on famous point cloud segmentation datasets.
arXiv Detail & Related papers (2023-07-27T03:42:12Z) - Learning Signed Distance Functions from Noisy 3D Point Clouds via Noise
to Noise Mapping [52.25114448281418]
Learning signed distance functions (SDFs) from 3D point clouds is an important task in 3D computer vision.
We propose to learn SDFs via a noise to noise mapping, which does not require any clean point cloud or ground truth supervision for training.
Our novelty lies in the noise to noise mapping which can infer a highly accurate SDF of a single object or scene from its multiple or even single noisy point cloud observations.
arXiv Detail & Related papers (2023-06-02T09:52:04Z) - Unsupervised Inference of Signed Distance Functions from Single Sparse
Point Clouds without Learning Priors [54.966603013209685]
It is vital to infer signed distance functions (SDFs) from 3D point clouds.
We present a neural network to directly infer SDFs from single sparse point clouds.
arXiv Detail & Related papers (2023-03-25T15:56:50Z) - Efficient Implicit Neural Reconstruction Using LiDAR [6.516471975863534]
We propose a new method that uses sparse LiDAR point clouds and rough odometry to reconstruct fine-grained implicit occupancy field efficiently within a few minutes.
As far as we know, our method is the first to reconstruct implicit scene representation from LiDAR-only input.
arXiv Detail & Related papers (2023-02-28T07:31:48Z) - ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds.
The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled.
The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z) - Efficient Urban-scale Point Clouds Segmentation with BEV Projection [0.0]
Most deep point clouds models directly conduct learning on 3D point clouds.
We propose to transfer the 3D point clouds to dense bird's-eye-view projection.
arXiv Detail & Related papers (2021-09-19T06:49:59Z) - Hidden Footprints: Learning Contextual Walkability from 3D Human Trails [70.01257397390361]
Current datasets only tell you where people are, not where they could be.
We first augment the set of valid, labeled walkable regions by propagating person observations between images, utilizing 3D information to create what we call hidden footprints.
We devise a training strategy designed for such sparse labels, combining a class-balanced classification loss with a contextual adversarial loss.
arXiv Detail & Related papers (2020-08-19T23:19:08Z) - 3D Point Cloud Feature Explanations Using Gradient-Based Methods [11.355723874379317]
We extend the saliency methods that have been shown to work on image data to deal with 3D data.
Driven by the insight that 3D data is inherently sparse, we visualise the features learnt by a voxel-based classification network.
Our results show that the Voxception-ResNet model can be pruned down to 5% of its parameters with negligible loss in accuracy.
arXiv Detail & Related papers (2020-06-09T23:17:24Z) - A Nearest Neighbor Network to Extract Digital Terrain Models from 3D
Point Clouds [1.6249267147413524]
We present an algorithm that operates on 3D-point clouds and estimates the underlying DTM for the scene using an end-to-end approach.
Our model learns neighborhood information and seamlessly integrates this with point-wise and block-wise global features.
arXiv Detail & Related papers (2020-05-21T15:54:55Z) - Label-Efficient Learning on Point Clouds using Approximate Convex
Decompositions [43.1279121348315]
We investigate the use of Approximate Convex Decompositions (ACD) as a self-supervisory signal for label-efficient learning of point cloud representations.
We show that using ACD to approximate ground truth segmentation provides excellent self-supervision for learning 3D point cloud representations.
arXiv Detail & Related papers (2020-03-30T21:44:43Z) - D3Feat: Joint Learning of Dense Detection and Description of 3D Local
Features [51.04841465193678]
We leverage a 3D fully convolutional network for 3D point clouds.
We propose a novel and practical learning mechanism that densely predicts both a detection score and a description feature for each 3D point.
Our method achieves state-of-the-art results in both indoor and outdoor scenarios.
arXiv Detail & Related papers (2020-03-06T12:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.