Related papers: ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

URL: http://arxiv.org/abs/2410.13924v1
Date: Thu, 17 Oct 2024 14:44:35 GMT
Title: ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding
Authors: Guangda Ji, Silvan Weder, Francis Engelmann, Marc Pollefeys, Hermann Blum,
Abstract summary: We introduce ARKit LabelMaker, the first large-scale, real-world 3D dataset with dense semantic annotations. We also push forward the state-of-the-art performance on ScanNet and ScanNet200 dataset with prevalent 3D semantic segmentation models.
Score: 51.509115746992165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The performance of neural networks scales with both their size and the amount of data they have been trained on. This is shown in both language and image generation. However, this requires scaling-friendly network architectures as well as large-scale datasets. Even though scaling-friendly architectures like transformers have emerged for 3D vision tasks, the GPT-moment of 3D vision remains distant due to the lack of training data. In this paper, we introduce ARKit LabelMaker, the first large-scale, real-world 3D dataset with dense semantic annotations. Specifically, we complement ARKitScenes dataset with dense semantic annotations that are automatically generated at scale. To this end, we extend LabelMaker, a recent automatic annotation pipeline, to serve the needs of large-scale pre-training. This involves extending the pipeline with cutting-edge segmentation models as well as making it robust to the challenges of large-scale processing. Further, we push forward the state-of-the-art performance on ScanNet and ScanNet200 dataset with prevalent 3D semantic segmentation models, demonstrating the efficacy of our generated dataset.

Related papers

Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation [92.17176311351469]
We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs.
arXiv Detail & Related papers (2025-02-04T18:18:50Z)
Open-Vocabulary High-Resolution 3D (OVHR3D) Data Segmentation and Annotation Framework [1.1280113914145702]
This research aims to design and develop a comprehensive and efficient framework for 3D segmentation tasks. The framework integrates Grounding DINO and Segment anything Model, augmented by an enhancement in 2D image rendering via 3D mesh.
arXiv Detail & Related papers (2024-12-09T07:39:39Z)
Open-Vocabulary Octree-Graph for 3D Scene Understanding [54.11828083068082]
Octree-Graph is a novel scene representation for open-vocabulary 3D scene understanding. An adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape.
arXiv Detail & Related papers (2024-11-25T10:14:10Z)
LABELMAKER: Automatic Semantic Label Generation from RGB-D Trajectories [59.14011485494713]
This work introduces a fully automated 2D/3D labeling framework that can generate labels for RGB-D scans at equal (or better) level of accuracy. We demonstrate the effectiveness of our LabelMaker pipeline by generating significantly better labels for the ScanNet datasets and automatically labelling the previously unlabeled ARKitScenes dataset.
arXiv Detail & Related papers (2023-11-20T20:40:24Z)
DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields [68.94868475824575]
This paper introduces a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations. We leverage the strong semantic prior within a 3D generative model to train a semantic decoder. Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data.
arXiv Detail & Related papers (2023-11-18T21:58:28Z)
Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task. Our approach involves making initial predictions of 2D semantic masks using different large vision models. To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z)
UniG3D: A Unified 3D Object Generation Dataset [75.49544172927749]
UniG3D is a unified 3D object generation dataset constructed by employing a universal data transformation pipeline on ShapeNet datasets. This pipeline converts each raw 3D model into comprehensive multi-modal data representation. The selection of data sources for our dataset is based on their scale and quality.
arXiv Detail & Related papers (2023-06-19T07:03:45Z)
Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving [34.368848580725576]
We develop a label generation pipeline that produces dense, visibility-aware labels for any given scene. This pipeline comprises three stages: voxel densification, reasoning, and image-guided voxel refinement. We propose a new model, dubbed Coarse-to-Fine Occupancy (CTF-Occ) network, which demonstrates superior performance on the Occ3D benchmarks.
arXiv Detail & Related papers (2023-04-27T17:40:08Z)
Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding [40.68012530554327]
We introduce a pretrained 3D backbone, called SST, for 3D indoor scene understanding. We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity. A series of extensive ablation studies further validate the scalability, generality, and superior performance enabled by our approach.
arXiv Detail & Related papers (2023-04-14T02:49:08Z)
Semi-Weakly Supervised Object Kinematic Motion Prediction [56.282759127180306]
Given a 3D object, kinematic motion prediction aims to identify the mobile parts as well as the corresponding motion parameters. We propose a graph neural network to learn the map between hierarchical part-level segmentation and mobile parts parameters. The network predictions yield a large scale of 3D objects with pseudo labeled mobility information.
arXiv Detail & Related papers (2023-03-31T02:37:36Z)
Learning 3D Human Pose Estimation from Dozens of Datasets using a Geometry-Aware Autoencoder to Bridge Between Skeleton Formats [80.12253291709673]
We propose a novel affine-combining autoencoder (ACAE) method to perform dimensionality reduction on the number of landmarks. Our approach scales to an extreme multi-dataset regime, where we use 28 3D human pose datasets to supervise one model.
arXiv Detail & Related papers (2022-12-29T22:22:49Z)
H3D: Benchmark on Semantic Segmentation of High-Resolution 3D Point Clouds and textured Meshes from UAV LiDAR and Multi-View-Stereo [4.263987603222371]
This paper introduces a 3D dataset which is unique in three ways. It depicts the village of Hessigheim (Germany) henceforth referred to as H3D. It is designed for promoting research in the field of 3D data analysis on one hand and to evaluate and rank emerging approaches.
arXiv Detail & Related papers (2021-02-10T09:33:48Z)
Rapid Pose Label Generation through Sparse Representation of Unknown Objects [7.32172860877574]
This work presents an approach for rapidly generating real-world, pose-annotated RGB-D data for unknown objects. We first source minimalistic labelings of an ordered set of arbitrarily chosen keypoints over a set of RGB-D videos. By solving an optimization problem, we combine these labels under a world frame to recover a sparse, keypoint-based representation of the object.
arXiv Detail & Related papers (2020-11-07T15:14:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.