Faces \`a la Carte: Text-to-Face Generation via Attribute
Disentanglement
- URL: http://arxiv.org/abs/2006.07606v2
- Date: Fri, 18 Sep 2020 07:21:45 GMT
- Title: Faces \`a la Carte: Text-to-Face Generation via Attribute
Disentanglement
- Authors: Tianren Wang, Teng Zhang, Brian Lovell
- Abstract summary: Text-to-Face (TTF) is a challenging task with great potential for diverse computer vision applications.
We propose a Text-to-Face model that produces images in high resolution (1024x1024) with text-to-image consistency.
We refer to our model as TTF-HD. Experimental results show that TTF-HD generates high-quality faces with state-of-the-art performance.
- Score: 9.10088750358281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-Face (TTF) synthesis is a challenging task with great potential for
diverse computer vision applications. Compared to Text-to-Image (TTI) synthesis
tasks, the textual description of faces can be much more complicated and
detailed due to the variety of facial attributes and the parsing of high
dimensional abstract natural language. In this paper, we propose a Text-to-Face
model that not only produces images in high resolution (1024x1024) with
text-to-image consistency, but also outputs multiple diverse faces to cover a
wide range of unspecified facial features in a natural way. By fine-tuning the
multi-label classifier and image encoder, our model obtains the vectors and
image embeddings which are used to transform the input noise vector sampled
from the normal distribution. Afterwards, the transformed noise vector is fed
into a pre-trained high-resolution image generator to produce a set of faces
with the desired facial attributes. We refer to our model as TTF-HD.
Experimental results show that TTF-HD generates high-quality faces with
state-of-the-art performance.
Related papers
- Arc2Face: A Foundation Model for ID-Consistent Human Faces [95.00331107591859]
Arc2Face is an identity-conditioned face foundation model.
It can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models.
arXiv Detail & Related papers (2024-03-18T10:32:51Z) - Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation
Using only Images [105.92311979305065]
TG-3DFace creates more realistic and aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over Latent3D.
The rendered face images generated by TG-3DFace achieve higher FID and CLIP score than text-to-2D face/image generation models.
arXiv Detail & Related papers (2023-08-31T14:26:33Z) - GaFET: Learning Geometry-aware Facial Expression Translation from
In-The-Wild Images [55.431697263581626]
We introduce a novel Geometry-aware Facial Expression Translation framework, which is based on parametric 3D facial representations and can stably decoupled expression.
We achieve higher-quality and more accurate facial expression transfer results compared to state-of-the-art methods, and demonstrate applicability of various poses and complex textures.
arXiv Detail & Related papers (2023-08-07T09:03:35Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - StyleT2F: Generating Human Faces from Textual Description Using
StyleGAN2 [0.0]
StyleT2F is a method of controlling the output of StyleGAN2 using text.
Our method proves to capture the required features correctly and shows consistency between the input text and the output images.
arXiv Detail & Related papers (2022-04-17T04:51:30Z) - AnyFace: Free-style Text-to-Face Synthesis and Manipulation [41.61972206254537]
This paper proposes the first free-style text-to-face method namely AnyFace.
AnyFace enables much wider open world applications such as metaverse, social media, cosmetics, forensics, etc.
arXiv Detail & Related papers (2022-03-29T08:27:38Z) - Learning Continuous Face Representation with Explicit Functions [20.5159277443333]
We propose an explicit model (EmFace) for human face representation in the form of a finite sum of mathematical terms.
EmFace achieves reasonable performance on several face image processing tasks, including face image restoration, denoising, and transformation.
arXiv Detail & Related papers (2021-10-25T03:49:20Z) - Multi-Attributed and Structured Text-to-Face Synthesis [1.3381749415517017]
Generative Adrial Networks (GANs) have revolutionized image synthesis through many applications like face generation, photograph editing, and image super-resolution.
This paper empirically proves that increasing the number of facial attributes in each textual description helps GANs generate more diverse and real-looking faces.
arXiv Detail & Related papers (2021-08-25T07:52:21Z) - Semantic Text-to-Face GAN -ST^2FG [0.7919810878571298]
We present a novel approach to generate facial images from semantic text descriptions.
For security and criminal identification, the ability to provide a GAN-based system that works like a sketch artist would be incredibly useful.
arXiv Detail & Related papers (2021-07-22T15:42:25Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.