HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
- URL: http://arxiv.org/abs/2310.08579v2
- Date: Fri, 15 Mar 2024 02:02:21 GMT
- Title: HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
- Authors: Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, Sergey Tulyakov,
- Abstract summary: We propose a unified framework, HyperHuman, that generates in-the-wild human images of high realism and diverse layouts.
Our model enforces the joint learning of image appearance, spatial relationship, and geometry in a unified network.
Our framework yields the state-of-the-art performance, generating hyper-realistic human images under diverse scenarios.
- Score: 114.15397904945185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite significant advances in large-scale text-to-image models, achieving hyper-realistic human image generation remains a desirable yet unsolved task. Existing models like Stable Diffusion and DALL-E 2 tend to generate human images with incoherent parts or unnatural poses. To tackle these challenges, our key insight is that human image is inherently structural over multiple granularities, from the coarse-level body skeleton to fine-grained spatial geometry. Therefore, capturing such correlations between the explicit appearance and latent structure in one model is essential to generate coherent and natural human images. To this end, we propose a unified framework, HyperHuman, that generates in-the-wild human images of high realism and diverse layouts. Specifically, 1) we first build a large-scale human-centric dataset, named HumanVerse, which consists of 340M images with comprehensive annotations like human pose, depth, and surface normal. 2) Next, we propose a Latent Structural Diffusion Model that simultaneously denoises the depth and surface normal along with the synthesized RGB image. Our model enforces the joint learning of image appearance, spatial relationship, and geometry in a unified network, where each branch in the model complements to each other with both structural awareness and textural richness. 3) Finally, to further boost the visual quality, we propose a Structure-Guided Refiner to compose the predicted conditions for more detailed generation of higher resolution. Extensive experiments demonstrate that our framework yields the state-of-the-art performance, generating hyper-realistic human images under diverse scenarios. Project Page: https://snap-research.github.io/HyperHuman/
Related papers
- Single Image, Any Face: Generalisable 3D Face Generation [59.9369171926757]
We propose a novel model, Gen3D-Face, which generates 3D human faces with unconstrained single image input.
To the best of our knowledge, this is the first attempt and benchmark for creating photorealistic 3D human face avatars from single images.
arXiv Detail & Related papers (2024-09-25T14:56:37Z) - PSHuman: Photorealistic Single-view Human Reconstruction using Cross-Scale Diffusion [43.850899288337025]
PSHuman is a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model.
It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions.
To enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X.
arXiv Detail & Related papers (2024-09-16T10:13:06Z) - HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors [47.62426718293504]
HumanSplat predicts the 3D Gaussian Splatting properties of any human from a single input image.
HumanSplat surpasses existing state-of-the-art methods in achieving photorealistic novel-view synthesis.
arXiv Detail & Related papers (2024-06-18T10:05:33Z) - StructLDM: Structured Latent Diffusion for 3D Human Generation [45.51684124904457]
StructLDM is a diffusion-based unconditional 3D human generative model learned from 2D images.
It enables different levels of controllable 3D human generation and editing, including pose/view/shape control, and high-level tasks including compositional generations, part-aware clothing editing, 3D virtual try-on, etc.
arXiv Detail & Related papers (2024-04-01T17:00:18Z) - Towards Effective Usage of Human-Centric Priors in Diffusion Models for
Text-based Human Image Generation [24.49857926071974]
Vanilla text-to-image diffusion models struggle with generating accurate human images.
Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls.
This paper explores the integration of human-centric priors directly into the model fine-tuning stage.
arXiv Detail & Related papers (2024-03-08T11:59:32Z) - InceptionHuman: Controllable Prompt-to-NeRF for Photorealistic 3D Human Generation [61.62346472443454]
InceptionHuman is a prompt-to-NeRF framework that allows easy control via a combination of prompts in different modalities to generate photorealistic 3D humans.
InceptionHuman achieves consistent 3D human generation within a progressively refined NeRF space.
arXiv Detail & Related papers (2023-11-27T15:49:41Z) - Pose Guided Human Image Synthesis with Partially Decoupled GAN [25.800174118151638]
Pose Guided Human Image Synthesis (PGHIS) is a challenging task of transforming a human image from the reference pose to a target pose.
We propose a method by decoupling the human body into several parts to guide the synthesis of a realistic image of the person.
In addition, we design a multi-head attention-based module for PGHIS.
arXiv Detail & Related papers (2022-10-07T15:31:37Z) - Generalizable Neural Performer: Learning Robust Radiance Fields for
Human Novel View Synthesis [52.720314035084215]
This work targets at using a general deep learning framework to synthesize free-viewpoint images of arbitrary human performers.
We present a simple yet powerful framework, named Generalizable Neural Performer (GNR), that learns a generalizable and robust neural body representation.
Experiments on GeneBody-1.0 and ZJU-Mocap show better robustness of our methods than recent state-of-the-art generalizable methods.
arXiv Detail & Related papers (2022-04-25T17:14:22Z) - SparseFusion: Dynamic Human Avatar Modeling from Sparse RGBD Images [49.52782544649703]
We propose a novel approach to reconstruct 3D human body shapes based on a sparse set of RGBD frames.
The main challenge is how to robustly fuse these sparse frames into a canonical 3D model.
Our framework is flexible, with potential applications going beyond shape reconstruction.
arXiv Detail & Related papers (2020-06-05T18:53:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.