GenLayNeRF: Generalizable Layered Representations with 3D Model
Alignment for Multi-Human View Synthesis
- URL: http://arxiv.org/abs/2309.11627v1
- Date: Wed, 20 Sep 2023 20:37:31 GMT
- Title: GenLayNeRF: Generalizable Layered Representations with 3D Model
Alignment for Multi-Human View Synthesis
- Authors: Youssef Abdelkareem, Shady Shehata, Fakhri Karray
- Abstract summary: GenLayNeRF is a generalizable layered scene representation for free-viewpoint rendering of multiple human subjects.
We divide the scene into multi-human layers anchored by the 3D body meshes.
We extract point-wise image-aligned and human-anchored features which are correlated and fused.
- Score: 1.6574413179773757
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Novel view synthesis (NVS) of multi-human scenes imposes challenges due to
the complex inter-human occlusions. Layered representations handle the
complexities by dividing the scene into multi-layered radiance fields, however,
they are mainly constrained to per-scene optimization making them inefficient.
Generalizable human view synthesis methods combine the pre-fitted 3D human
meshes with image features to reach generalization, yet they are mainly
designed to operate on single-human scenes. Another drawback is the reliance on
multi-step optimization techniques for parametric pre-fitting of the 3D body
models that suffer from misalignment with the images in sparse view settings
causing hallucinations in synthesized views. In this work, we propose,
GenLayNeRF, a generalizable layered scene representation for free-viewpoint
rendering of multiple human subjects which requires no per-scene optimization
and very sparse views as input. We divide the scene into multi-human layers
anchored by the 3D body meshes. We then ensure pixel-level alignment of the
body models with the input views through a novel end-to-end trainable module
that carries out iterative parametric correction coupled with multi-view
feature fusion to produce aligned 3D models. For NVS, we extract point-wise
image-aligned and human-anchored features which are correlated and fused using
self-attention and cross-attention modules. We augment low-level RGB values
into the features with an attention-based RGB fusion module. To evaluate our
approach, we construct two multi-human view synthesis datasets; DeepMultiSyn
and ZJU-MultiHuman. The results indicate that our proposed approach outperforms
generalizable and non-human per-scene NeRF methods while performing at par with
layered per-scene methods without test time optimization.
Related papers
- WE-GS: An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections [8.261637198675151]
Novel View Synthesis (NVS) from unconstrained photo collections is challenging in computer graphics.
We propose an efficient point-based differentiable rendering framework for scene reconstruction from photo collections.
Our approach outperforms existing approaches on the rendering quality of novel view and appearance synthesis with high converge and rendering speed.
arXiv Detail & Related papers (2024-06-04T15:17:37Z) - FreeSplat: Generalizable 3D Gaussian Splatting Towards Free-View Synthesis of Indoor Scenes [50.534213038479926]
FreeSplat is capable of reconstructing geometrically consistent 3D scenes from long sequence input towards free-view synthesis.
We propose a simple but effective free-view training strategy that ensures robust view synthesis across broader view range regardless of the number of views.
arXiv Detail & Related papers (2024-05-28T08:40:14Z) - Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from
Multi-view Images [79.39247661907397]
We introduce an effective framework Generalizable Model-based Neural Radiance Fields to synthesize free-viewpoint images.
Specifically, we propose a geometry-guided attention mechanism to register the appearance code from multi-view 2D images to a geometry proxy.
arXiv Detail & Related papers (2023-03-24T03:32:02Z) - Multi-Plane Neural Radiance Fields for Novel View Synthesis [5.478764356647437]
Novel view synthesis is a long-standing problem that revolves around rendering frames of scenes from novel camera viewpoints.
In this work, we examine the performance, generalization, and efficiency of single-view multi-plane neural radiance fields.
We propose a new multiplane NeRF architecture that accepts multiple views to improve the synthesis results and expand the viewing range.
arXiv Detail & Related papers (2023-03-03T06:32:55Z) - High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization [51.878078860524795]
We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views.
Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content.
arXiv Detail & Related papers (2022-11-28T18:59:52Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - Human View Synthesis using a Single Sparse RGB-D Input [16.764379184593256]
We present a novel view synthesis framework to generate realistic renders from unseen views of any human captured from a single-view sensor with sparse RGB-D.
An enhancer network leverages the overall fidelity, even in occluded areas from the original view, producing crisp renders with fine details.
arXiv Detail & Related papers (2021-12-27T20:13:53Z) - DeepMultiCap: Performance Capture of Multiple Characters Using Sparse
Multiview Cameras [63.186486240525554]
DeepMultiCap is a novel method for multi-person performance capture using sparse multi-view cameras.
Our method can capture time varying surface details without the need of using pre-scanned template models.
arXiv Detail & Related papers (2021-05-01T14:32:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.