You Only Train Once: Multi-Identity Free-Viewpoint Neural Human
Rendering from Monocular Videos
- URL: http://arxiv.org/abs/2303.05835v1
- Date: Fri, 10 Mar 2023 10:23:17 GMT
- Title: You Only Train Once: Multi-Identity Free-Viewpoint Neural Human
Rendering from Monocular Videos
- Authors: Jaehyeok Kim, Dongyoon Wee, Dan Xu
- Abstract summary: You Only Train Once (YOTO) is a dynamic human generation framework, which performs free-viewpoint rendering of different human identities with distinct motions.
In this paper, we propose a set of learnable identity codes to expand the capability of the framework for multi-identity free-viewpoint rendering.
YOTO shows state-of-the-art performance on all evaluation metrics while showing significant benefits in training and inference efficiency as well as rendering quality.
- Score: 10.795522875068073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce You Only Train Once (YOTO), a dynamic human generation
framework, which performs free-viewpoint rendering of different human
identities with distinct motions, via only one-time training from monocular
videos. Most prior works for the task require individualized optimization for
each input video that contains a distinct human identity, leading to a
significant amount of time and resources for the deployment, thereby impeding
the scalability and the overall application potential of the system. In this
paper, we tackle this problem by proposing a set of learnable identity codes to
expand the capability of the framework for multi-identity free-viewpoint
rendering, and an effective pose-conditioned code query mechanism to finely
model the pose-dependent non-rigid motions. YOTO optimizes neural radiance
fields (NeRF) by utilizing designed identity codes to condition the model for
learning various canonical T-pose appearances in a single shared volumetric
representation. Besides, our joint learning of multiple identities within a
unified model incidentally enables flexible motion transfer in high-quality
photo-realistic renderings for all learned appearances. This capability expands
its potential use in important applications, including Virtual Reality. We
present extensive experimental results on ZJU-MoCap and PeopleSnapshot to
clearly demonstrate the effectiveness of our proposed model. YOTO shows
state-of-the-art performance on all evaluation metrics while showing
significant benefits in training and inference efficiency as well as rendering
quality. The code and model will be made publicly available soon.
Related papers
- Learned Single-Pass Multitasking Perceptual Graphics for Immersive Displays [11.15417027415116]
We propose a lightweight, text-guided, learned multitasking perceptual graphics model.
Our model supports a variety of perceptual tasks, including foveated rendering, dynamic range enhancement, image denoising, and chromostereopsis.
We evaluate our model's performance on embedded platforms and validate the perceptual quality of our model through a user study.
arXiv Detail & Related papers (2024-07-31T19:05:00Z) - Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling [10.914612535745789]
This paper introduces Motion-oriented Compositional Neural Radiance Fields (MoCo-NeRF)
MoCo-NeRF is a framework designed to perform free-viewpoint rendering of monocular human videos.
arXiv Detail & Related papers (2024-07-16T17:59:01Z) - VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability.
An identity-aware appearance controller integrates additional facial information without compromising other appearance details.
A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps.
VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.
SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.
We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - Federated Multi-View Synthesizing for Metaverse [52.59476179535153]
The metaverse is expected to provide immersive entertainment, education, and business applications.
Virtual reality (VR) transmission over wireless networks is data- and computation-intensive.
We have developed a novel multi-view synthesizing framework that can efficiently provide synthesizing, storage, and communication resources for wireless content delivery in the metaverse.
arXiv Detail & Related papers (2023-12-18T13:51:56Z) - GHuNeRF: Generalizable Human NeRF from a Monocular Video [63.741714198481354]
GHuNeRF learns a generalizable human NeRF model from a monocular video.
We validate our approach on the widely-used ZJU-MoCap dataset.
arXiv Detail & Related papers (2023-08-31T09:19:06Z) - MonoHuman: Animatable Human Neural Field from Monocular Video [30.113937856494726]
We propose a novel framework MonoHuman, which robustly renders view-consistent and high-fidelity avatars under arbitrary novel poses.
Our key insight is to model the deformation field with bi-directional constraints and explicitly leverage the off-the-peg information to reason the feature for coherent results.
arXiv Detail & Related papers (2023-04-04T17:55:03Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z) - Neural Human Performer: Learning Generalizable Radiance Fields for Human
Performance Rendering [34.80975358673563]
We propose a novel approach that learns generalizable neural radiance fields based on a parametric human body model for robust performance capture.
Experiments on the ZJU-MoCap and AIST datasets show that our method significantly outperforms recent generalizable NeRF methods on unseen identities and poses.
arXiv Detail & Related papers (2021-09-15T17:32:46Z) - Multimodal Face Synthesis from Visual Attributes [85.87796260802223]
We propose a novel generative adversarial network that simultaneously synthesizes identity preserving multimodal face images.
multimodal stretch-in modules are introduced in the discriminator which discriminates between real and fake images.
arXiv Detail & Related papers (2021-04-09T13:47:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.