CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning
- URL: http://arxiv.org/abs/2412.03059v2
- Date: Mon, 10 Mar 2025 03:54:25 GMT
- Title: CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning
- Authors: Runjian Chen, Hang Zhang, Avinash Ravichandran, Hyoungseob Park, Wenqi Shao, Alex Wong, Ping Luo,
- Abstract summary: We propose a joint unsupervised differentiable-rendering-based pre-training method for images and point clouds, termed CLAP.<n>Our method overcomes the computational hurdle by Curvature Sampling to select the more informative points/pixels for pre-training.<n>Experiments show that CLAP achieves up to 100% more performance gain as compared to previous SOTA pre-training methods.
- Score: 43.7594705101778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised 3D representation learning reduces the burden of labeling multimodal 3D data for fusion perception tasks. Among different pre-training paradigms, differentiable-rendering-based methods have shown most promise. However, existing works separately conduct pre-training for each modalities due to computational costs of processing large point clouds with images. As such, mutual benefit of high-level semantics (from image) and 3D structure (from point cloud) has not been exploited. To address this gap, we propose a joint unsupervised differentiable-rendering-based pre-training method for images and point clouds, termed CLAP, short for Curvature sampLing and leArnable Prototype. Specifically, our method overcomes the computational hurdle by Curvature Sampling to select the more informative points/pixels for pre-training. To uncover the performance benefits brought by their complementarity, we propose to use learnable prototypes to represent parts of the 3D scenes in a common feature space and an Expectation-Maximization training scheme to associate embeddings of each modality to prototypes. We further propose a swapping prediction loss that explores their interplay through prototypes along with a Gram Matrix Regularization term to maintain training stability. Experiments on NuScenes and Waymo datasets show that CLAP achieves up to 100% more performance gain as compared to previous SOTA pre-training methods. Codes and models will be released.
Related papers
- CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting [88.24743308058441]
We present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS.
We develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations.
arXiv Detail & Related papers (2024-12-26T09:54:25Z) - A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision [65.33043028101471]
We introduce a diffusion model for Gaussian Splats, SplatDiffusion, to enable generation of three-dimensional structures from single images.<n>Existing methods rely on deterministic, feed-forward predictions, which limit their ability to handle the inherent ambiguity of 3D inference from 2D data.
arXiv Detail & Related papers (2024-12-01T00:29:57Z) - GS-PT: Exploiting 3D Gaussian Splatting for Comprehensive Point Cloud Understanding via Self-supervised Learning [15.559369116540097]
Self-supervised learning of point cloud aims to leverage unlabeled 3D data to learn meaningful representations without reliance on manual annotations.
We propose GS-PT, which integrates 3D Gaussian Splatting (3DGS) into point cloud self-supervised learning for the first time.
Our pipeline utilizes transformers as the backbone for self-supervised pre-training and introduces novel contrastive learning tasks through 3DGS.
arXiv Detail & Related papers (2024-09-08T03:46:47Z) - Personalized 3D Human Pose and Shape Refinement [19.082329060985455]
regression-based methods have dominated the field of 3D human pose and shape estimation.
We propose to construct dense correspondences between initial human model estimates and the corresponding images.
We show that our approach not only consistently leads to better image-model alignment, but also to improved 3D accuracy.
arXiv Detail & Related papers (2024-03-18T10:13:53Z) - PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models [51.24979014650188]
We present PointSeg, a training-free paradigm that leverages off-the-shelf vision foundation models to address 3D scene perception tasks.
PointSeg can segment anything in 3D scene by acquiring accurate 3D prompts to align their corresponding pixels across frames.
Our approach significantly surpasses the state-of-the-art specialist training-free model by 14.1$%$, 12.3$%$, and 12.6$%$ mAP on ScanNet, ScanNet++, and KITTI-360 datasets.
arXiv Detail & Related papers (2024-03-11T03:28:20Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training [44.790636524264]
Point Prompt Training is a novel framework for multi-dataset synergistic learning in the context of 3D representation learning.
It can overcome the negative transfer associated with synergistic learning and produce generalizable representations.
It achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training.
arXiv Detail & Related papers (2023-08-18T17:59:57Z) - CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D
Dense CLIP [19.66617835750012]
Training a 3D scene understanding model requires complicated human annotations.
vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties.
We propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision.
arXiv Detail & Related papers (2023-03-08T17:30:58Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z) - A Closer Look at Invariances in Self-supervised Pre-training for 3D
Vision [0.0]
Self-supervised pre-training for 3D vision has drawn increasing research interest in recent years.
We present a unified framework under which various pre-training methods can be investigated.
We propose a simple but effective method that jointly pre-trains a 3D encoder and a depth map encoder using contrastive learning.
arXiv Detail & Related papers (2022-07-11T16:44:15Z) - On Triangulation as a Form of Self-Supervision for 3D Human Pose
Estimation [57.766049538913926]
Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant.
Much of the recent attention has shifted towards semi and (or) weakly supervised learning.
We propose to impose multi-view geometrical constraints by means of a differentiable triangulation and to use it as form of self-supervision during training when no labels are available.
arXiv Detail & Related papers (2022-03-29T19:11:54Z) - Using Adaptive Gradient for Texture Learning in Single-View 3D
Reconstruction [0.0]
Learning-based approaches for 3D model reconstruction have attracted attention owing to its modern applications.
We present a novel sampling algorithm by optimizing the gradient of predicted coordinates based on the variance on the sampling image.
We also adopt Frechet Inception Distance (FID) to form a loss function in learning, which helps bridging the gap between rendered images and input images.
arXiv Detail & Related papers (2021-04-29T07:52:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.