Representation Learning for Point Cloud Understanding
- URL: http://arxiv.org/abs/2512.06058v1
- Date: Fri, 05 Dec 2025 17:09:49 GMT
- Title: Representation Learning for Point Cloud Understanding
- Authors: Siming Yan,
- Abstract summary: dissertation focuses on three main areas: supervised representation learning for point cloud primitive segmentation, self-supervised learning methods, and transfer learning from 2D to 3D.<n>Our approach integrates pre-trained 2D models to support 3D network training, significantly improves 3D understanding without merely transforming 2D data.
- Score: 4.304883944435261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advancement of technology, 3D data acquisition and utilization have become increasingly prevalent across various fields, including computer vision, robotics, and geospatial analysis. 3D data, captured through methods such as 3D scanners, LiDARs, and RGB-D cameras, provides rich geometric, shape, and scale information. When combined with 2D images, 3D data offers machines a comprehensive understanding of their environment, benefiting applications like autonomous driving, robotics, remote sensing, and medical treatment. This dissertation focuses on three main areas: supervised representation learning for point cloud primitive segmentation, self-supervised learning methods, and transfer learning from 2D to 3D. Our approach, which integrates pre-trained 2D models to support 3D network training, significantly improves 3D understanding without merely transforming 2D data. Extensive experiments validate the effectiveness of our methods, showcasing their potential to advance point cloud representation learning by effectively integrating 2D knowledge.
Related papers
- Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting [64.64738535860351]
We present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations.<n>Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding.<n>By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence.
arXiv Detail & Related papers (2025-07-24T14:53:26Z) - TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP [52.79100775328595]
3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions.<n>Existing 3D visual grounding methods rely on separate encoders for different modalities.<n>We propose a unified 2D pre-trained multi-modal network to process all three modalities.
arXiv Detail & Related papers (2025-07-20T10:28:06Z) - CL3R: 3D Reconstruction and Contrastive Learning for Enhanced Robotic Manipulation Representations [19.71090711790973]
We propose a novel 3D pre-training framework designed to enhance robotic manipulation policies.<n>Our method integrates both spatial awareness and semantic understanding by employing a point cloud Masked Autoencoder.<n>We mitigate camera view ambiguity and improve generalization, enabling robust perception from novel viewpoints at test time.
arXiv Detail & Related papers (2025-07-11T02:16:32Z) - Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning.<n>UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner [19.908670991088556]
We introduce a 3D to multi-view autoencoder that reconstructs point clouds and multi-view images from 3D and projected 2D features.<n>A novel two-stage self-training strategy is proposed to align 2D and 3D representations.<n>Our method outperforms state-of-the-art counterparts across various downstream tasks, including 3D classification, part segmentation, and object detection.
arXiv Detail & Related papers (2023-11-17T22:10:03Z) - Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining [21.878815180924832]
We present a novel 3D pretraining method by leveraging 2D networks learned from rich 2D datasets.
Our experiments show that the 3D models pretrained with 2D knowledge boost the performances across various real-world 3D downstream tasks.
arXiv Detail & Related papers (2021-04-10T05:40:42Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - Self-supervised Feature Learning by Cross-modality and Cross-view
Correspondences [32.01548991331616]
This paper presents a novel self-supervised learning approach to learn both 2D image features and 3D point cloud features.
It exploits cross-modality and cross-view correspondences without using any annotated human labels.
The effectiveness of the learned 2D and 3D features is evaluated by transferring them on five different tasks.
arXiv Detail & Related papers (2020-04-13T02:57:25Z) - 2.75D: Boosting learning by representing 3D Medical imaging to 2D
features for small data [54.223614679807994]
3D convolutional neural networks (CNNs) have started to show superior performance to 2D CNNs in numerous deep learning tasks.
Applying transfer learning on 3D CNN is challenging due to a lack of publicly available pre-trained 3D models.
In this work, we proposed a novel 2D strategical representation of volumetric data, namely 2.75D.
As a result, 2D CNN networks can also be used to learn volumetric information.
arXiv Detail & Related papers (2020-02-11T08:24:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.