Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis
- URL: http://arxiv.org/abs/2304.12317v2
- Date: Mon, 2 Oct 2023 13:07:37 GMT
- Title: Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis
- Authors: Chonghyuk Song, Gengshan Yang, Kangle Deng, Jun-Yan Zhu, Deva Ramanan
- Abstract summary: We present Total-Recon, the first method to reconstruct deformable scenes from long monocular RGBD videos.
Our method hierarchically decomposes the scene into the background and objects, whose motion is decomposed into root-body motion and local articulations.
- Score: 76.72505510632904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We explore the task of embodied view synthesis from monocular videos of
deformable scenes. Given a minute-long RGBD video of people interacting with
their pets, we render the scene from novel camera trajectories derived from the
in-scene motion of actors: (1) egocentric cameras that simulate the point of
view of a target actor and (2) 3rd-person cameras that follow the actor.
Building such a system requires reconstructing the root-body and articulated
motion of every actor, as well as a scene representation that supports
free-viewpoint synthesis. Longer videos are more likely to capture the scene
from diverse viewpoints (which helps reconstruction) but are also more likely
to contain larger motions (which complicates reconstruction). To address these
challenges, we present Total-Recon, the first method to photorealistically
reconstruct deformable scenes from long monocular RGBD videos. Crucially, to
scale to long videos, our method hierarchically decomposes the scene into the
background and objects, whose motion is decomposed into carefully initialized
root-body motion and local articulations. To quantify such "in-the-wild"
reconstruction and view synthesis, we collect ground-truth data from a
specialized stereo RGBD capture rig for 11 challenging videos, significantly
outperforming prior methods. Our code, model, and data can be found at
https://andrewsonga.github.io/totalrecon .
Related papers
- Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis [43.02778060969546]
We propose a controllable monocular dynamic view synthesis pipeline.
Our model does not require depth as input, and does not explicitly model 3D scene geometry.
We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.
arXiv Detail & Related papers (2024-05-23T17:59:52Z) - DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos [21.93514516437402]
We present DreamScene4D, the first approach to generate 3D dynamic scenes of multiple objects from monocular videos via novel view synthesis.
Our key insight is a "decompose-recompose" approach that factorizes the video scene into the background and object tracks.
We show extensive results on challenging DAVIS, Kubric, and self-captured videos with quantitative comparisons and a user preference study.
arXiv Detail & Related papers (2024-05-03T17:55:34Z) - Replay: Multi-modal Multi-view Acted Videos for Casual Holography [76.49914880351167]
Replay is a collection of multi-view, multi-modal videos of humans interacting socially.
Overall, the dataset contains over 4000 minutes of footage and over 7 million timestamped high-resolution frames.
The Replay dataset has many potential applications, such as novel-view synthesis, 3D reconstruction, novel-view acoustic synthesis, human body and face analysis, and training generative models.
arXiv Detail & Related papers (2023-07-22T12:24:07Z) - RUST: Latent Neural Scene Representations from Unposed Imagery [21.433079925439234]
Inferring structure of 3D scenes from 2D observations is a fundamental challenge in computer vision.
Recent popularized approaches based on neural scene representations have achieved tremendous impact.
RUST (Really Unposed Scene representation Transformer) is a pose-free approach to novel view trained on RGB images alone.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - State of the Art in Dense Monocular Non-Rigid 3D Reconstruction [100.9586977875698]
3D reconstruction of deformable (or non-rigid) scenes from a set of monocular 2D image observations is a long-standing and actively researched area of computer vision and graphics.
This survey focuses on state-of-the-art methods for dense non-rigid 3D reconstruction of various deformable objects and composite scenes from monocular videos or sets of monocular views.
arXiv Detail & Related papers (2022-10-27T17:59:53Z) - HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular
Video [44.58519508310171]
We introduce a free-viewpoint rendering method -- HumanNeRF -- that works on a given monocular video of a human performing complex body motions.
Our method enables pausing the video at any frame and rendering the subject from arbitrary new camera viewpoints.
arXiv Detail & Related papers (2022-01-11T18:51:21Z) - Recognizing Scenes from Novel Viewpoints [99.90914180489456]
Humans can perceive scenes in 3D from a handful of 2D views. For AI agents, the ability to recognize a scene from any viewpoint given only a few images enables them to efficiently interact with the scene and its objects.
We propose a model which takes as input a few RGB images of a new scene and recognizes the scene from novel viewpoints by segmenting it into semantic categories.
arXiv Detail & Related papers (2021-12-02T18:59:40Z) - NeuralDiff: Segmenting 3D objects that move in egocentric videos [92.95176458079047]
We study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground.
This task is reminiscent of the classic background subtraction problem, but is significantly harder because all parts of the scene, static and dynamic, generate a large apparent motion.
In particular, we consider egocentric videos and further separate the dynamic component into objects and the actor that observes and moves them.
arXiv Detail & Related papers (2021-10-19T12:51:35Z) - Non-Rigid Neural Radiance Fields: Reconstruction and Novel View
Synthesis of a Dynamic Scene From Monocular Video [76.19076002661157]
Non-Rigid Neural Radiance Fields (NR-NeRF) is a reconstruction and novel view synthesis approach for general non-rigid dynamic scenes.
We show that even a single consumer-grade camera is sufficient to synthesize sophisticated renderings of a dynamic scene from novel virtual camera views.
arXiv Detail & Related papers (2020-12-22T18:46:12Z) - Associative3D: Volumetric Reconstruction from Sparse Views [17.5320459412718]
This paper studies the problem of 3D volumetric reconstruction from two views of a scene with an unknown camera.
We propose a new approach that estimates reconstructions, distributions over the camera/object and camera/camera transformations.
We train and test our approach on a dataset of indoor scenes, and rigorously evaluate the merits of our joint reasoning approach.
arXiv Detail & Related papers (2020-07-27T17:58:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.