MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts
- URL: http://arxiv.org/abs/2410.23332v1
- Date: Wed, 30 Oct 2024 17:59:57 GMT
- Title: MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts
- Authors: Jie Zhu, Yixiong Chen, Mingyu Ding, Ping Luo, Leye Wang, Jingdong Wang,
- Abstract summary: We study human-centric text-to-image generation in context of faces and hands.
We propose a method called Mixture of Low-rank Experts (MoLE) by considering low-rank modules trained on close-up hand and face images respectively as experts.
This concept draws inspiration from our observation of low-rank refinement, where a low-rank module trained by a customized close-up dataset has the potential to enhance the corresponding image part when applied at an appropriate scale.
- Score: 61.274246025372044
- License:
- Abstract: Text-to-image diffusion has attracted vast attention due to its impressive image-generation capabilities. However, when it comes to human-centric text-to-image generation, particularly in the context of faces and hands, the results often fall short of naturalness due to insufficient training priors. We alleviate the issue in this work from two perspectives. 1) From the data aspect, we carefully collect a human-centric dataset comprising over one million high-quality human-in-the-scene images and two specific sets of close-up images of faces and hands. These datasets collectively provide a rich prior knowledge base to enhance the human-centric image generation capabilities of the diffusion model. 2) On the methodological front, we propose a simple yet effective method called Mixture of Low-rank Experts (MoLE) by considering low-rank modules trained on close-up hand and face images respectively as experts. This concept draws inspiration from our observation of low-rank refinement, where a low-rank module trained by a customized close-up dataset has the potential to enhance the corresponding image part when applied at an appropriate scale. To validate the superiority of MoLE in the context of human-centric image generation compared to state-of-the-art, we construct two benchmarks and perform evaluations with diverse metrics and human studies. Datasets, model, and code are released at https://sites.google.com/view/mole4diffuser/.
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback [5.9726297901501475]
We introduce a novel approach tailored specifically for human image generation utilizing Direct Preference Optimization (DPO)
Specifically, we introduce an efficient method for constructing a specialized DPO dataset for training human image generation models without the need for costly human feedback.
Our method demonstrates its versatility and effectiveness in generating human images, including personalized text-to-image generation.
arXiv Detail & Related papers (2024-05-30T16:18:05Z) - CosmicMan: A Text-to-Image Foundation Model for Humans [30.155677646188572]
We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images.
CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions.
arXiv Detail & Related papers (2024-04-01T17:59:05Z) - Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On [29.217423805933727]
Diffusion model-based approaches have recently become popular, as they are excellent at image synthesis tasks.
We propose an Texture-Preserving Diffusion (TPD) model for virtual try-on, which enhances the fidelity of the results.
Second, we propose a novel diffusion-based method that predicts a precise inpainting mask based on the person and reference garment images.
arXiv Detail & Related papers (2024-04-01T12:43:22Z) - Towards Effective Usage of Human-Centric Priors in Diffusion Models for
Text-based Human Image Generation [24.49857926071974]
Vanilla text-to-image diffusion models struggle with generating accurate human images.
Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls.
This paper explores the integration of human-centric priors directly into the model fine-tuning stage.
arXiv Detail & Related papers (2024-03-08T11:59:32Z) - Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion [114.15397904945185]
We propose a unified framework, HyperHuman, that generates in-the-wild human images of high realism and diverse layouts.
Our model enforces the joint learning of image appearance, spatial relationship, and geometry in a unified network.
Our framework yields the state-of-the-art performance, generating hyper-realistic human images under diverse scenarios.
arXiv Detail & Related papers (2023-10-12T17:59:34Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - Bridging Composite and Real: Towards End-to-end Deep Image Matting [88.79857806542006]
We study the roles of semantics and details for image matting.
We propose a novel Glance and Focus Matting network (GFM), which employs a shared encoder and two separate decoders.
Comprehensive empirical studies have demonstrated that GFM outperforms state-of-the-art methods.
arXiv Detail & Related papers (2020-10-30T10:57:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.