Related papers: Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

URL: http://arxiv.org/abs/2407.10200v1
Date: Sun, 14 Jul 2024 13:42:05 GMT
Title: Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data
Authors: Tuo Feng, Wenguan Wang, Ruijie Quan, Yi Yang,
Abstract summary: Shape2Scene (S2S) is a novel method that learns representations of large-scale 3D scenes from 3D shape data. MH-P/V establishes direct paths to highresolution features that capture deep semantic information across multiple scales. S2SS amalgamates points from various shapes, creating a random pseudo scene (comprising multiple objects) for training data. Experiments have demonstrated the transferability of 3D representations learned by MH-P/V across shape-level and scene-level 3D tasks.
Score: 61.36872381753621
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current 3D self-supervised learning methods of 3D scenes face a data desert issue, resulting from the time-consuming and expensive collecting process of 3D scene data. Conversely, 3D shape datasets are easier to collect. Despite this, existing pre-training strategies on shape data offer limited potential for 3D scene understanding due to significant disparities in point quantities. To tackle these challenges, we propose Shape2Scene (S2S), a novel method that learns representations of large-scale 3D scenes from 3D shape data. We first design multiscale and high-resolution backbones for shape and scene level 3D tasks, i.e., MH-P (point-based) and MH-V (voxel-based). MH-P/V establishes direct paths to highresolution features that capture deep semantic information across multiple scales. This pivotal nature makes them suitable for a wide range of 3D downstream tasks that tightly rely on high-resolution features. We then employ a Shape-to-Scene strategy (S2SS) to amalgamate points from various shapes, creating a random pseudo scene (comprising multiple objects) for training data, mitigating disparities between shapes and scenes. Finally, a point-point contrastive loss (PPC) is applied for the pre-training of MH-P/V. In PPC, the inherent correspondence (i.e., point pairs) is naturally obtained in S2SS. Extensive experiments have demonstrated the transferability of 3D representations learned by MH-P/V across shape-level and scene-level 3D tasks. MH-P achieves notable performance on well-known point cloud datasets (93.8% OA on ScanObjectNN and 87.6% instance mIoU on ShapeNetPart). MH-V also achieves promising performance in 3D semantic segmentation and 3D object detection.

Related papers

Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians [27.19772539224761]
We introduce Can3Tok, the first 3D scene-level variational autoencoder capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding.<n>We propose a general pipeline for 3D scene data processing to address scale inconsistency issue.
arXiv Detail & Related papers (2025-08-02T18:43:45Z)
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining [100.23919762298227]
We introduce SceneSplat, the first large-scale 3D indoor scene understanding approach that operates on 3DGS. We also propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. SceneSplat-7K is the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes.
arXiv Detail & Related papers (2025-03-23T12:50:25Z)
U3DS$^3$: Unsupervised 3D Semantic Scene Segmentation [19.706172244951116]
This paper presents U3DS$3$, as a step towards completely unsupervised point cloud segmentation for any holistic 3D scenes. The initial step of our proposed approach involves generating superpoints based on the geometric characteristics of each scene. We then undergo a learning process through a spatial clustering-based methodology, followed by iterative training using pseudo-labels generated in accordance with the cluster centroids.
arXiv Detail & Related papers (2023-11-10T12:05:35Z)
Model2Scene: Learning 3D Scene Representation via Contrastive Language-CAD Models Pre-training [105.3421541518582]
Current successful methods of 3D scene perception rely on the large-scale annotated point cloud. We propose Model2Scene, a novel paradigm that learns free 3D scene representation from Computer-Aided Design (CAD) models and languages. Model2Scene yields impressive label-free 3D object salient detection with an average mAP of 46.08% and 55.49% on the ScanNet and S3DIS datasets, respectively.
arXiv Detail & Related papers (2023-09-29T03:51:26Z)
Learning 3D Scene Priors with 2D Supervision [37.79852635415233]
We propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth. Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories. Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction.
arXiv Detail & Related papers (2022-11-25T15:03:32Z)
Prompt-guided Scene Generation for 3D Zero-Shot Learning [8.658191774247944]
We propose a prompt-guided 3D scene generation and supervision method to augment 3D data to learn the network better. First, we merge point clouds of two 3D models in certain ways described by a prompt. The prompt acts like the annotation describing each 3D scene. We have achieved state-of-the-art ZSL and generalized ZSL performance on synthetic (ModelNet40, ModelNet10) and real-scanned (ScanOjbectNN) 3D object datasets.
arXiv Detail & Related papers (2022-09-29T11:24:33Z)
MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D Segmentation [91.6658845016214]
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks. We render a 3D shape from multiple views, and set up a dense correspondence learning task within the contrastive learning framework. As a result, the learned 2D representations are view-invariant and geometrically consistent.
arXiv Detail & Related papers (2022-08-18T00:48:15Z)
Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds [96.9027094562957]
We introduce a-temporal representation learning framework, capable of learning from unlabeled tasks. Inspired by how infants learn from visual data in the wild, we explore rich cues derived from the 3D data. STRL takes two temporally-related frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly.
arXiv Detail & Related papers (2021-09-01T04:17:11Z)
Exploring Deep 3D Spatial Encodings for Large-Scale 3D Scene Understanding [19.134536179555102]
We propose an alternative approach to overcome the limitations of CNN based approaches by encoding the spatial features of raw 3D point clouds into undirected graph models. The proposed method achieves on par state-of-the-art accuracy with improved training time and model stability thus indicating strong potential for further research.
arXiv Detail & Related papers (2020-11-29T12:56:19Z)
Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic Segmentation [87.54570024320354]
State-of-the-art methods for large-scale driving-scene LiDAR semantic segmentation often project and process the point clouds in the 2D space. A straightforward solution to tackle the issue of 3D-to-2D projection is to keep the 3D representation and process the points in the 3D space. We develop a 3D cylinder partition and a 3D cylinder convolution based framework, termed as Cylinder3D, which exploits the 3D topology relations and structures of driving-scene point clouds.
arXiv Detail & Related papers (2020-08-04T13:56:19Z)
DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes [54.239416488865565]
We propose a fast single-stage 3D object detection method for LIDAR data. The core novelty of our method is a fast, single-pass architecture that both detects objects in 3D and estimates their shapes. We find that our proposed method achieves state-of-the-art results by 5% on object detection in ScanNet scenes, and it gets top results by 3.4% in the Open dataset.
arXiv Detail & Related papers (2020-04-02T17:48:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.