RUST: Latent Neural Scene Representations from Unposed Imagery
- URL: http://arxiv.org/abs/2211.14306v2
- Date: Fri, 24 Mar 2023 16:56:25 GMT
- Title: RUST: Latent Neural Scene Representations from Unposed Imagery
- Authors: Mehdi S. M. Sajjadi, Aravindh Mahendran, Thomas Kipf, Etienne Pot,
Daniel Duckworth, Mario Lucic, Klaus Greff
- Abstract summary: Inferring structure of 3D scenes from 2D observations is a fundamental challenge in computer vision.
Recent popularized approaches based on neural scene representations have achieved tremendous impact.
RUST (Really Unposed Scene representation Transformer) is a pose-free approach to novel view trained on RGB images alone.
- Score: 21.433079925439234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inferring the structure of 3D scenes from 2D observations is a fundamental
challenge in computer vision. Recently popularized approaches based on neural
scene representations have achieved tremendous impact and have been applied
across a variety of applications. One of the major remaining challenges in this
space is training a single model which can provide latent representations which
effectively generalize beyond a single scene. Scene Representation Transformer
(SRT) has shown promise in this direction, but scaling it to a larger set of
diverse scenes is challenging and necessitates accurately posed ground truth
data. To address this problem, we propose RUST (Really Unposed Scene
representation Transformer), a pose-free approach to novel view synthesis
trained on RGB images alone. Our main insight is that one can train a Pose
Encoder that peeks at the target image and learns a latent pose embedding which
is used by the decoder for view synthesis. We perform an empirical
investigation into the learned latent pose structure and show that it allows
meaningful test-time camera transformations and accurate explicit pose
readouts. Perhaps surprisingly, RUST achieves similar quality as methods which
have access to perfect camera pose, thereby unlocking the potential for
large-scale training of amortized neural scene representations.
Related papers
- DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - Learning Robust Multi-Scale Representation for Neural Radiance Fields
from Unposed Images [65.41966114373373]
We present an improved solution to the neural image-based rendering problem in computer vision.
The proposed approach could synthesize a realistic image of the scene from a novel viewpoint at test time.
arXiv Detail & Related papers (2023-11-08T08:18:23Z) - FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses
via Pixel-Aligned Scene Flow [26.528667940013598]
Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning.
Key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion.
We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass.
arXiv Detail & Related papers (2023-05-31T20:58:46Z) - One-Shot Neural Fields for 3D Object Understanding [112.32255680399399]
We present a unified and compact scene representation for robotics.
Each object in the scene is depicted by a latent code capturing geometry and appearance.
This representation can be decoded for various tasks such as novel view rendering, 3D reconstruction, and stable grasp prediction.
arXiv Detail & Related papers (2022-10-21T17:33:14Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - Real-Time Neural Character Rendering with Pose-Guided Multiplane Images [75.62730144924566]
We propose pose-guided multiplane image (MPI) synthesis which can render an animatable character in real scenes with photorealistic quality.
We use a portable camera rig to capture the multi-view images along with the driving signal for the moving subject.
arXiv Detail & Related papers (2022-04-25T17:51:38Z) - ViewFormer: NeRF-free Neural Rendering from Few Images Using
Transformers [34.4824364161812]
Novel view synthesis is a problem where we are given only a few context views sparsely covering a scene or an object.
The goal is to predict novel viewpoints in the scene, which requires learning priors.
We propose a 2D-only method that maps multiple context views and a query pose to a new image in a single pass of a neural network.
arXiv Detail & Related papers (2022-03-18T21:08:23Z) - Domain Adaptation of Networks for Camera Pose Estimation: Learning
Camera Pose Estimation Without Pose Labels [8.409695277909421]
One of the key criticisms of deep learning is that large amounts of expensive and difficult-to-acquire training data are required to train models.
DANCE enables the training of models without access to any labels on the target task.
renders labeled synthetic images from the 3D model, and bridges the inevitable domain gap between synthetic and real images.
arXiv Detail & Related papers (2021-11-29T17:45:38Z) - GIRAFFE: Representing Scenes as Compositional Generative Neural Feature
Fields [45.21191307444531]
Deep generative models allow for photorealistic image synthesis at high resolutions.
But for many applications, this is not enough: content creation also needs to be controllable.
Our key hypothesis is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis.
arXiv Detail & Related papers (2020-11-24T14:14:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.