Factor Decomposed Generative Adversarial Networks for Text-to-Image
Synthesis
- URL: http://arxiv.org/abs/2303.13821v1
- Date: Fri, 24 Mar 2023 05:57:53 GMT
- Title: Factor Decomposed Generative Adversarial Networks for Text-to-Image
Synthesis
- Authors: Jiguo Li, Xiaobin Liu, Lirong Zheng
- Abstract summary: We propose Factor Decomposed Generative Adversa Networks(FDGAN)
We firstly generate images from the noise vector and then apply the sentence embedding in the normalization layer for both generator and discriminators.
The experimental results show that decomposing the noise and the sentence embedding can disentangle latent factors in text-to-image synthesis.
- Score: 7.658760090153791
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Prior works about text-to-image synthesis typically concatenated the sentence
embedding with the noise vector, while the sentence embedding and the noise
vector are two different factors, which control the different aspects of the
generation. Simply concatenating them will entangle the latent factors and
encumber the generative model.
In this paper, we attempt to decompose these two factors and propose Factor
Decomposed Generative Adversarial Networks~(FDGAN). To achieve this, we firstly
generate images from the noise vector and then apply the sentence embedding in
the normalization layer for both generator and discriminators. We also design
an additive norm layer to align and fuse the text-image features. The
experimental results show that decomposing the noise and the sentence embedding
can disentangle latent factors in text-to-image synthesis, and make the
generative model more efficient. Compared with the baseline, FDGAN can achieve
better performance, while fewer parameters are used.
Related papers
- Generating Intermediate Representations for Compositional Text-To-Image Generation [16.757550214291015]
We propose a compositional approach for text-to-image generation based on two stages.
In the first stage, we design a diffusion-based generative model to produce one or more aligned intermediate representations conditioned on text.
In the second stage, we map these representations, together with the text, to the final output image using a separate diffusion-based generative model.
arXiv Detail & Related papers (2024-10-13T10:24:55Z) - Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion
Models [68.47333676663312]
We show a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models.
The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens.
We illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.
arXiv Detail & Related papers (2024-02-21T03:01:17Z) - SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
In this work, we apply diffusion models to approach sequence-to-sequence text generation.
We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation.
Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
arXiv Detail & Related papers (2022-12-20T15:16:24Z) - Contrastive Learning for Diverse Disentangled Foreground Generation [67.81298739373766]
We introduce a new method for diverse foreground generation with explicit control over various factors.
We leverage contrastive learning with latent codes to generate diverse foreground results for the same masked input.
Experiments demonstrate the superiority of our method over state-of-the-arts in result diversity and generation controllability.
arXiv Detail & Related papers (2022-11-04T18:51:04Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Describing Sets of Images with Textual-PCA [89.46499914148993]
We seek to semantically describe a set of images, capturing both the attributes of single images and the variations within the set.
Our procedure is analogous to Principle Component Analysis, in which the role of projection vectors is replaced with generated phrases.
arXiv Detail & Related papers (2022-10-21T17:10:49Z) - Text to Image Synthesis using Stacked Conditional Variational
Autoencoders and Conditional Generative Adversarial Networks [0.0]
Current text to image synthesis approaches falls short of producing a highresolution image that represent a text descriptor.
This study uses Conditional VAEs as an initial generator to produce a high-level sketch of the text descriptor.
The proposed architecture benefits from a conditioning augmentation and a residual block on the Conditional GAN network to achieve the results.
arXiv Detail & Related papers (2022-07-06T13:43:56Z) - OptGAN: Optimizing and Interpreting the Latent Space of the Conditional
Text-to-Image GANs [8.26410341981427]
We study how to ensure that generated samples are believable, realistic or natural.
We present a novel algorithm which identifies semantically-understandable directions in the latent space of a conditional text-to-image GAN architecture.
arXiv Detail & Related papers (2022-02-25T20:00:33Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Leveraging Conditional Generative Models in a General Explanation
Framework of Classifier Decisions [0.0]
We show that visual explanation can be produced as the difference between two generated images.
We present two different approximations and implementations of the general formulation.
arXiv Detail & Related papers (2021-06-21T09:41:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.