Learning 3D Object Shape and Layout without 3D Supervision
- URL: http://arxiv.org/abs/2206.07028v1
- Date: Tue, 14 Jun 2022 17:49:44 GMT
- Title: Learning 3D Object Shape and Layout without 3D Supervision
- Authors: Georgia Gkioxari, Nikhila Ravi, Justin Johnson
- Abstract summary: A 3D scene consists of a set of objects, each with a shape and a layout giving their position in space.
We propose a method that learns to predict 3D shape and layout for objects without any ground truth shape or layout information.
Our approach outperforms supervised approaches trained on smaller and less diverse datasets.
- Score: 26.575177430506667
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A 3D scene consists of a set of objects, each with a shape and a layout
giving their position in space. Understanding 3D scenes from 2D images is an
important goal, with applications in robotics and graphics. While there have
been recent advances in predicting 3D shape and layout from a single image,
most approaches rely on 3D ground truth for training which is expensive to
collect at scale. We overcome these limitations and propose a method that
learns to predict 3D shape and layout for objects without any ground truth
shape or layout information: instead we rely on multi-view images with 2D
supervision which can more easily be collected at scale. Through extensive
experiments on 3D Warehouse, Hypersim, and ScanNet we demonstrate that our
approach scales to large datasets of realistic images, and compares favorably
to methods relying on 3D ground truth. On Hypersim and ScanNet where reliable
3D ground truth is not available, our approach outperforms supervised
approaches trained on smaller and less diverse datasets.
Related papers
- Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment.
Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images.
During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z) - SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving.
We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z) - OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic
Perception, Reconstruction and Generation [107.71752592196138]
We propose OmniObject3D, a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects.
It comprises 6,000 scanned objects in 190 daily categories, sharing common classes with popular 2D datasets.
Each 3D object is captured with both 2D and 3D sensors, providing textured meshes, point clouds, multiview rendered images, and multiple real-captured videos.
arXiv Detail & Related papers (2023-01-18T18:14:18Z) - Learning 3D Scene Priors with 2D Supervision [37.79852635415233]
We propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth.
Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories.
Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction.
arXiv Detail & Related papers (2022-11-25T15:03:32Z) - SAT: 2D Semantics Assisted Training for 3D Visual Grounding [95.84637054325039]
3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region.
Point clouds are sparse, noisy, and contain limited semantic information compared with 2D images.
We propose 2D Semantics Assisted Training (SAT) that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning.
arXiv Detail & Related papers (2021-05-24T17:58:36Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - Interactive Annotation of 3D Object Geometry using 2D Scribbles [84.51514043814066]
In this paper, we propose an interactive framework for annotating 3D object geometry from point cloud data and RGB imagery.
Our framework targets naive users without artistic or graphics expertise.
arXiv Detail & Related papers (2020-08-24T21:51:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.