Learning to Reconstruct and Segment 3D Objects
- URL: http://arxiv.org/abs/2010.09582v1
- Date: Mon, 19 Oct 2020 15:09:04 GMT
- Title: Learning to Reconstruct and Segment 3D Objects
- Authors: Bo Yang
- Abstract summary: We aim to understand scenes and the objects within them by learning general and robust representations using deep neural networks.
This thesis makes three core contributions from object-level 3D shape estimation from single or multiple views to scene-level semantic understanding.
- Score: 4.709764624933227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To endow machines with the ability to perceive the real-world in a three
dimensional representation as we do as humans is a fundamental and
long-standing topic in Artificial Intelligence. Given different types of visual
inputs such as images or point clouds acquired by 2D/3D sensors, one important
goal is to understand the geometric structure and semantics of the 3D
environment. Traditional approaches usually leverage hand-crafted features to
estimate the shape and semantics of objects or scenes. However, they are
difficult to generalize to novel objects and scenarios, and struggle to
overcome critical issues caused by visual occlusions. By contrast, we aim to
understand scenes and the objects within them by learning general and robust
representations using deep neural networks, trained on large-scale real-world
3D data. To achieve these aims, this thesis makes three core contributions from
object-level 3D shape estimation from single or multiple views to scene-level
semantic understanding.
Related papers
- SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Learning 3D object-centric representation through prediction [12.008668555280668]
We develop a novel network architecture that learns to 1) segment objects from discrete images, 2) infer their 3D locations, and 3) perceive depth.
The core idea is treating objects as latent causes of visual input which the brain uses to make efficient predictions of future scenes.
arXiv Detail & Related papers (2024-03-06T14:19:11Z) - NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization [80.3424839706698]
We present NeurOCS, a framework that uses instance masks 3D boxes as input to learn 3D object shapes by means of differentiable rendering.
Our approach rests on insights in learning a category-level shape prior directly from real driving scenes.
We make critical design choices to learn object coordinates more effectively from an object-centric view.
arXiv Detail & Related papers (2023-05-28T16:18:41Z) - Multiview Compressive Coding for 3D Reconstruction [77.95706553743626]
We introduce a simple framework that operates on 3D points of single objects or whole scenes.
Our model, Multiview Compressive Coding, learns to compress the input appearance and geometry to predict the 3D structure.
arXiv Detail & Related papers (2023-01-19T18:59:52Z) - 3D Concept Grounding on Neural Fields [99.33215488324238]
Existing visual reasoning approaches typically utilize supervised methods to extract 2D segmentation masks on which concepts are grounded.
Humans are capable of grounding concepts on the underlying 3D representation of images.
We propose to leverage the continuous, differentiable nature of neural fields to segment and learn concepts.
arXiv Detail & Related papers (2022-07-13T17:59:33Z) - Object Scene Representation Transformer [56.40544849442227]
We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis.
OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods.
It is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.
arXiv Detail & Related papers (2022-06-14T15:40:47Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Disentangling 3D Prototypical Networks For Few-Shot Concept Learning [29.02523358573336]
We present neural architectures that disentangle RGB-D images into objects' shapes and styles and a map of the background scene.
Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay.
arXiv Detail & Related papers (2020-11-06T14:08:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.