Towards Effective Usage of Human-Centric Priors in Diffusion Models for
Text-based Human Image Generation
- URL: http://arxiv.org/abs/2403.05239v1
- Date: Fri, 8 Mar 2024 11:59:32 GMT
- Title: Towards Effective Usage of Human-Centric Priors in Diffusion Models for
Text-based Human Image Generation
- Authors: Junyan Wang, Zhenhong Sun, Zhiyu Tan, Xuanbai Chen, Weihua Chen, Hao
Li, Cheng Zhang, Yang Song
- Abstract summary: Vanilla text-to-image diffusion models struggle with generating accurate human images.
Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls.
This paper explores the integration of human-centric priors directly into the model fine-tuning stage.
- Score: 24.49857926071974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vanilla text-to-image diffusion models struggle with generating accurate
human images, commonly resulting in imperfect anatomies such as unnatural
postures or disproportionate limbs.Existing methods address this issue mostly
by fine-tuning the model with extra images or adding additional controls --
human-centric priors such as pose or depth maps -- during the image generation
phase. This paper explores the integration of these human-centric priors
directly into the model fine-tuning stage, essentially eliminating the need for
extra conditions at the inference stage. We realize this idea by proposing a
human-centric alignment loss to strengthen human-related information from the
textual prompts within the cross-attention maps. To ensure semantic detail
richness and human structural accuracy during fine-tuning, we introduce
scale-aware and step-wise constraints within the diffusion process, according
to an in-depth analysis of the cross-attention layer. Extensive experiments
show that our method largely improves over state-of-the-art text-to-image
models to synthesize high-quality human images based on user-written prompts.
Project page: \url{https://hcplayercvpr2024.github.io}.
Related papers
- MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts [61.274246025372044]
We study human-centric text-to-image generation in context of faces and hands.
We propose a method called Mixture of Low-rank Experts (MoLE) by considering low-rank modules trained on close-up hand and face images respectively as experts.
This concept draws inspiration from our observation of low-rank refinement, where a low-rank module trained by a customized close-up dataset has the potential to enhance the corresponding image part when applied at an appropriate scale.
arXiv Detail & Related papers (2024-10-30T17:59:57Z) - Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - PSHuman: Photorealistic Single-view Human Reconstruction using Cross-Scale Diffusion [43.850899288337025]
PSHuman is a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model.
It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions.
To enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X.
arXiv Detail & Related papers (2024-09-16T10:13:06Z) - HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance [80.97360194728705]
AbHuman is the first large-scale synthesized human benchmark focusing on anatomical anomalies.
HumanRefiner is a novel plug-and-play approach for the coarse-to-fine refinement of human anomalies in text-to-image generation.
arXiv Detail & Related papers (2024-07-09T15:14:41Z) - Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback [5.9726297901501475]
We introduce a novel approach tailored specifically for human image generation utilizing Direct Preference Optimization (DPO)
Specifically, we introduce an efficient method for constructing a specialized DPO dataset for training human image generation models without the need for costly human feedback.
Our method demonstrates its versatility and effectiveness in generating human images, including personalized text-to-image generation.
arXiv Detail & Related papers (2024-05-30T16:18:05Z) - Improving face generation quality and prompt following with synthetic captions [57.47448046728439]
We introduce a training-free pipeline designed to generate accurate appearance descriptions from images of people.
We then use these synthetic captions to fine-tune a text-to-image diffusion model.
Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces.
arXiv Detail & Related papers (2024-05-17T15:50:53Z) - HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion [114.15397904945185]
We propose a unified framework, HyperHuman, that generates in-the-wild human images of high realism and diverse layouts.
Our model enforces the joint learning of image appearance, spatial relationship, and geometry in a unified network.
Our framework yields the state-of-the-art performance, generating hyper-realistic human images under diverse scenarios.
arXiv Detail & Related papers (2023-10-12T17:59:34Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - Pose Guided Human Image Synthesis with Partially Decoupled GAN [25.800174118151638]
Pose Guided Human Image Synthesis (PGHIS) is a challenging task of transforming a human image from the reference pose to a target pose.
We propose a method by decoupling the human body into several parts to guide the synthesis of a realistic image of the person.
In addition, we design a multi-head attention-based module for PGHIS.
arXiv Detail & Related papers (2022-10-07T15:31:37Z) - Structure-aware Person Image Generation with Pose Decomposition and
Semantic Correlation [29.727033198797518]
We propose a structure-aware flow based method for high-quality person image generation.
We decompose the human body into different semantic parts and apply different networks to predict the flow fields for these parts separately.
Our method can generate high-quality results under large pose discrepancy and outperforms state-of-the-art methods in both qualitative and quantitative comparisons.
arXiv Detail & Related papers (2021-02-05T03:07:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.