Mix3D: Out-of-Context Data Augmentation for 3D Scenes
- URL: http://arxiv.org/abs/2110.02210v1
- Date: Tue, 5 Oct 2021 17:57:45 GMT
- Title: Mix3D: Out-of-Context Data Augmentation for 3D Scenes
- Authors: Alexey Nekrasov, Jonas Schult, Or Litany, Bastian Leibe, Francis
Engelmann
- Abstract summary: We present Mix3D, a data augmentation technique for segmenting large-scale 3D scenes.
In experiments, we show that models trained with Mix3D profit from a significant performance boost on indoor (ScanNet, S3DIS) and outdoor datasets.
- Score: 33.939743149673696
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Mix3D, a data augmentation technique for segmenting large-scale 3D
scenes. Since scene context helps reasoning about object semantics, current
works focus on models with large capacity and receptive fields that can fully
capture the global context of an input 3D scene. However, strong contextual
priors can have detrimental implications like mistaking a pedestrian crossing
the street for a car. In this work, we focus on the importance of balancing
global scene context and local geometry, with the goal of generalizing beyond
the contextual priors in the training set. In particular, we propose a "mixing"
technique which creates new training samples by combining two augmented scenes.
By doing so, object instances are implicitly placed into novel out-of-context
environments and therefore making it harder for models to rely on scene context
alone, and instead infer semantics from local structure as well. We perform
detailed analysis to understand the importance of global context, local
structures and the effect of mixing scenes. In experiments, we show that models
trained with Mix3D profit from a significant performance boost on indoor
(ScanNet, S3DIS) and outdoor datasets (SemanticKITTI). Mix3D can be trivially
used with any existing method, e.g., trained with Mix3D, MinkowskiNet
outperforms all prior state-of-the-art methods by a significant margin on the
ScanNet test benchmark 78.1 mIoU. Code is available at:
https://nekrasov.dev/mix3d/
Related papers
- OSN: Infinite Representations of Dynamic 3D Scenes from Monocular Videos [7.616167860385134]
It has long been challenging to recover the underlying dynamic 3D scene representations from a monocular RGB video.
We introduce a new framework, called OSN, to learn all plausible 3D scene configurations that match the input video.
Our method demonstrates a clear advantage in learning fine-grained 3D scene geometry.
arXiv Detail & Related papers (2024-07-08T05:03:46Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - DreamScape: 3D Scene Creation via Gaussian Splatting joint Correlation Modeling [23.06464506261766]
We present DreamScape, a method for creating highly consistent 3D scenes solely from textual descriptions.
Our approach involves a 3D Gaussian Guide for scene representation, consisting of semantic primitives (objects) and their spatial transformations.
A progressive scale control is tailored during local object generation, ensuring that objects of different sizes and densities adapt to the scene.
arXiv Detail & Related papers (2024-04-14T12:13:07Z) - SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets.
We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - Prompt-guided Scene Generation for 3D Zero-Shot Learning [8.658191774247944]
We propose a prompt-guided 3D scene generation and supervision method to augment 3D data to learn the network better.
First, we merge point clouds of two 3D models in certain ways described by a prompt. The prompt acts like the annotation describing each 3D scene.
We have achieved state-of-the-art ZSL and generalized ZSL performance on synthetic (ModelNet40, ModelNet10) and real-scanned (ScanOjbectNN) 3D object datasets.
arXiv Detail & Related papers (2022-09-29T11:24:33Z) - RandomRooms: Unsupervised Pre-training from Synthetic Shapes and
Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets.
Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications.
In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z) - Semantic Scene Completion via Integrating Instances and Scene
in-the-Loop [73.11401855935726]
Semantic Scene Completion aims at reconstructing a complete 3D scene with precise voxel-wise semantics from a single-view depth or RGBD image.
We present Scene-Instance-Scene Network (textitSISNet), which takes advantages of both instance and scene level semantic information.
Our method is capable of inferring fine-grained shape details as well as nearby objects whose semantic categories are easily mixed-up.
arXiv Detail & Related papers (2021-04-08T09:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.