Multiview Compressive Coding for 3D Reconstruction
- URL: http://arxiv.org/abs/2301.08247v1
- Date: Thu, 19 Jan 2023 18:59:52 GMT
- Title: Multiview Compressive Coding for 3D Reconstruction
- Authors: Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer,
Georgia Gkioxari
- Abstract summary: We introduce a simple framework that operates on 3D points of single objects or whole scenes.
Our model, Multiview Compressive Coding, learns to compress the input appearance and geometry to predict the 3D structure.
- Score: 77.95706553743626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A central goal of visual recognition is to understand objects and scenes from
a single image. 2D recognition has witnessed tremendous progress thanks to
large-scale learning and general-purpose representations. Comparatively, 3D
poses new challenges stemming from occlusions not depicted in the image. Prior
works try to overcome these by inferring from multiple views or rely on scarce
CAD models and category-specific priors which hinder scaling to novel settings.
In this work, we explore single-view 3D reconstruction by learning
generalizable representations inspired by advances in self-supervised learning.
We introduce a simple framework that operates on 3D points of single objects or
whole scenes coupled with category-agnostic large-scale training from diverse
RGB-D videos. Our model, Multiview Compressive Coding (MCC), learns to compress
the input appearance and geometry to predict the 3D structure by querying a
3D-aware decoder. MCC's generality and efficiency allow it to learn from
large-scale and diverse data sources with strong generalization to novel
objects imagined by DALL$\cdot$E 2 or captured in-the-wild with an iPhone.
Related papers
- 3D Feature Distillation with Object-Centric Priors [9.626027459292926]
2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images.
Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific or focus on indoor room scan data.
We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency.
arXiv Detail & Related papers (2024-06-26T20:16:49Z) - Deep Models for Multi-View 3D Object Recognition: A Review [16.500711021549947]
Multi-view 3D representations for object recognition has thus far demonstrated the most promising results for achieving state-of-the-art performance.
This review paper comprehensively covers recent progress in multi-view 3D object recognition methods for 3D classification and retrieval tasks.
arXiv Detail & Related papers (2024-04-23T16:54:31Z) - 3D-LFM: Lifting Foundation Model [29.48835001900286]
deep learning has expanded our capability to reconstruct a wide range of object classes.
Our approach harnesses the inherent permutation equivariance transformers to manage varying number points per 3D data instance.
We demonstrate state the art performance across 2D-3D lifting task benchmarks.
arXiv Detail & Related papers (2023-12-19T06:38:18Z) - NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization [80.3424839706698]
We present NeurOCS, a framework that uses instance masks 3D boxes as input to learn 3D object shapes by means of differentiable rendering.
Our approach rests on insights in learning a category-level shape prior directly from real driving scenes.
We make critical design choices to learn object coordinates more effectively from an object-centric view.
arXiv Detail & Related papers (2023-05-28T16:18:41Z) - Object Scene Representation Transformer [56.40544849442227]
We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis.
OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods.
It is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.
arXiv Detail & Related papers (2022-06-14T15:40:47Z) - Unsupervised Learning of Visual 3D Keypoints for Control [104.92063943162896]
Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations.
We propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner.
These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space.
arXiv Detail & Related papers (2021-06-14T17:59:59Z) - Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations [61.870882736758624]
We propose a novel self-supervised paradigm to learn Multi-View Transformation Equivariant Representations (MV-TER)
Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after the transformation via projection.
Then, we self-train a representation to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after the transformation.
arXiv Detail & Related papers (2021-03-01T06:24:17Z) - Learning to Reconstruct and Segment 3D Objects [4.709764624933227]
We aim to understand scenes and the objects within them by learning general and robust representations using deep neural networks.
This thesis makes three core contributions from object-level 3D shape estimation from single or multiple views to scene-level semantic understanding.
arXiv Detail & Related papers (2020-10-19T15:09:04Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.