Pose Guided Multi-person Image Generation From Text
- URL: http://arxiv.org/abs/2203.04907v1
- Date: Wed, 9 Mar 2022 17:38:03 GMT
- Title: Pose Guided Multi-person Image Generation From Text
- Authors: Soon Yau Cheong, Armin Mustafa, Andrew Gilbert
- Abstract summary: Existing methods struggle to create high fidelity full-body images, especially multiple people.
We propose a pose-guided text-to-image model, using pose as an additional input constraint.
We show results on the Deepfashion dataset and create a new multi-person Deepfashion dataset to demonstrate the multi-capabilities of our approach.
- Score: 15.15576618501609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have recently been shown to generate high quality images from
texts. However, existing methods struggle to create high fidelity full-body
images, especially multiple people. A person's pose has a high degree of
freedom that is difficult to describe using words only; this creates errors in
the generated image, such as incorrect body proportions and pose. We propose a
pose-guided text-to-image model, using pose as an additional input constraint.
Using the proposed Keypoint Pose Encoding (KPE) to encode human pose into low
dimensional representation, our model can generate novel multi-person images
accurately representing the pose and text descriptions provided, with minimal
errors. We demonstrate that KPE is invariant to changes in the target image
domain and image resolution; we show results on the Deepfashion dataset and
create a new multi-person Deepfashion dataset to demonstrate the
multi-capabilities of our approach.
Related papers
- PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation [38.958695275774616]
We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities.
We showcase the potential of such an embroidered pose representation for (1) SMPL regression from image with optional text cue; and (2) on the task of fine-grained instruction generation.
arXiv Detail & Related papers (2024-09-10T14:09:39Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - Text-Conditional Contextualized Avatars For Zero-Shot Personalization [47.85747039373798]
We propose a pipeline that enables personalization of image generation with avatars capturing a user's identity in a delightful way.
Our pipeline is zero-shot, avatar texture and style agnostic, and does not require training on the avatar at all.
We show, for the first time, how to leverage large-scale image datasets to learn human 3D pose parameters.
arXiv Detail & Related papers (2023-04-14T22:00:44Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Highly Personalized Text Embedding for Image Manipulation by Stable
Diffusion [34.662798793560995]
We present a simple yet highly effective approach to personalization using highly personalized (PerHi) text embedding.
Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text.
arXiv Detail & Related papers (2023-03-15T17:07:45Z) - UMFuse: Unified Multi View Fusion for Human Editing applications [36.94334399493266]
We design a multi-view fusion network that takes the pose key points and texture from multiple source images.
We show the application of our network on two newly proposed tasks - Multi-view human reposing and Mix&Match Human Image generation.
arXiv Detail & Related papers (2022-11-17T05:09:58Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Generating Person Images with Appearance-aware Pose Stylizer [66.44220388377596]
We present a novel end-to-end framework to generate realistic person images based on given person poses and appearances.
The core of our framework is a novel generator called Appearance-aware Pose Stylizer (APS) which generates human images by coupling the target pose with the conditioned person appearance progressively.
arXiv Detail & Related papers (2020-07-17T15:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.