From Text to Pose to Image: Improving Diffusion Model Control and Quality
- URL: http://arxiv.org/abs/2411.12872v2
- Date: Fri, 22 Nov 2024 10:26:27 GMT
- Title: From Text to Pose to Image: Improving Diffusion Model Control and Quality
- Authors: Clément Bonnet, Ariel N. Lee, Franck Wertel, Antoine Tamano, Tanguy Cizain, Pablo Ducru,
- Abstract summary: We introduce a text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity.
Together, these two new state-of-the-art models enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models.
- Score: 0.5183511047901651
- License:
- Abstract: In the last two years, text-to-image diffusion models have become extremely popular. As their quality and usage increase, a major concern has been the need for better output control. In addition to prompt engineering, one effective method to improve the controllability of diffusion models has been to condition them on additional modalities such as image style, depth map, or keypoints. This forms the basis of ControlNets or Adapters. When attempting to apply these methods to control human poses in outputs of text-to-image diffusion models, two main challenges have arisen. The first challenge is generating poses following a wide range of semantic text descriptions, for which previous methods involved searching for a pose within a dataset of (caption, pose) pairs. The second challenge is conditioning image generation on a specified pose while keeping both high aesthetic and high pose fidelity. In this article, we fix these two main issues by introducing a text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity. Together, these two new state-of-the-art models enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models. We release all models and the code used for the experiments at https://github.com/clement-bonnet/text-to-pose.
Related papers
- GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation [7.0646249774097525]
We propose a framework with GUNet as the main model, PoseDiffusion.
It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model.
Results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation.
arXiv Detail & Related papers (2024-09-18T04:05:59Z) - Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation [29.79050316749927]
We introduce a novel approach to pose-conditioned human image generation, dividing the process into two stages: hand generation and subsequent body outpainting around the hands.
A novel blending technique is introduced to preserve the hand details during the second stage that combines the results of both stages in a coherent way.
Our approach not only enhances the quality of the generated hands but also offers improved control over hand pose, advancing the capabilities of pose-conditioned human image generation.
arXiv Detail & Related papers (2024-03-15T23:31:41Z) - Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - SingleInsert: Inserting New Concepts from a Single Image into
Text-to-Image Models for Flexible Editing [59.3017821001455]
SingleInsert is an image-to-text (I2T) inversion method with single source images containing the same concept.
In this work, we propose a simple and effective baseline for single-image I2T inversion, named SingleInsert.
With the proposed techniques, SingleInsert excels in single concept generation with high visual fidelity while allowing flexible editing.
arXiv Detail & Related papers (2023-10-12T07:40:39Z) - Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion [50.59261592343479]
We present Kandinsky1, a novel exploration of latent diffusion architecture.
The proposed model is trained separately to map text embeddings to image embeddings of CLIP.
We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
arXiv Detail & Related papers (2023-10-05T12:29:41Z) - UPGPT: Universal Diffusion Model for Person Image Generation, Editing
and Pose Transfer [15.15576618501609]
Text-to-image models (T2I) have been used to generate high quality images of people.
However, due to the random nature of the generation process, the person has a different appearance.
We propose a multimodal diffusion model that accepts text, pose, and visual prompting.
arXiv Detail & Related papers (2023-04-18T10:05:37Z) - Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free
Videos [107.65147103102662]
In this work, we utilize datasets (i.e.,image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos.
Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable text-to-image generation.
In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks.
arXiv Detail & Related papers (2023-04-03T17:55:14Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Implementing and Experimenting with Diffusion Models for Text-to-Image
Generation [0.0]
Two models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image.
Text-to-image models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet.
This thesis contributes by reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model.
arXiv Detail & Related papers (2022-09-22T12:03:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.