Text to Image Synthesis using Stacked Conditional Variational
Autoencoders and Conditional Generative Adversarial Networks
- URL: http://arxiv.org/abs/2207.03332v1
- Date: Wed, 6 Jul 2022 13:43:56 GMT
- Title: Text to Image Synthesis using Stacked Conditional Variational
Autoencoders and Conditional Generative Adversarial Networks
- Authors: Haileleol Tibebu, Aadin Malik, Varuna De Silva
- Abstract summary: Current text to image synthesis approaches falls short of producing a highresolution image that represent a text descriptor.
This study uses Conditional VAEs as an initial generator to produce a high-level sketch of the text descriptor.
The proposed architecture benefits from a conditioning augmentation and a residual block on the Conditional GAN network to achieve the results.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthesizing a realistic image from textual description is a major challenge
in computer vision. Current text to image synthesis approaches falls short of
producing a highresolution image that represent a text descriptor. Most
existing studies rely either on Generative Adversarial Networks (GANs) or
Variational Auto Encoders (VAEs). GANs has the capability to produce sharper
images but lacks the diversity of outputs, whereas VAEs are good at producing a
diverse range of outputs, but the images generated are often blurred. Taking
into account the relative advantages of both GANs and VAEs, we proposed a new
stacked Conditional VAE (CVAE) and Conditional GAN (CGAN) network architecture
for synthesizing images conditioned on a text description. This study uses
Conditional VAEs as an initial generator to produce a high-level sketch of the
text descriptor. This high-level sketch output from first stage and a text
descriptor is used as an input to the conditional GAN network. The second stage
GAN produces a 256x256 high resolution image. The proposed architecture
benefits from a conditioning augmentation and a residual block on the
Conditional GAN network to achieve the results. Multiple experiments were
conducted using CUB and Oxford-102 dataset and the result of the proposed
approach is compared against state-ofthe-art techniques such as StackGAN. The
experiments illustrate that the proposed method generates a high-resolution
image conditioned on text descriptions and yield competitive results based on
Inception and Frechet Inception Score using both datasets
Related papers
- Improving Scene Text Image Super-resolution via Dual Prior Modulation
Network [20.687100711699788]
Scene text image super-resolution (STISR) aims to simultaneously increase the resolution and legibility of the text images.
Existing approaches neglect the global structure of the text, which bounds the semantic determinism of the scene text.
Our work proposes a plug-and-play module dubbed Dual Prior Modulation Network (DPMN), which leverages dual image-level priors to bring performance gain over existing approaches.
arXiv Detail & Related papers (2023-02-21T02:59:37Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - OptGAN: Optimizing and Interpreting the Latent Space of the Conditional
Text-to-Image GANs [8.26410341981427]
We study how to ensure that generated samples are believable, realistic or natural.
We present a novel algorithm which identifies semantically-understandable directions in the latent space of a conditional text-to-image GAN architecture.
arXiv Detail & Related papers (2022-02-25T20:00:33Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - Aggregated Contextual Transformations for High-Resolution Image
Inpainting [57.241749273816374]
We propose an enhanced GAN-based model, named Aggregated COntextual-Transformation GAN (AOT-GAN) for high-resolution image inpainting.
To enhance context reasoning, we construct the generator of AOT-GAN by stacking multiple layers of a proposed AOT block.
For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task.
arXiv Detail & Related papers (2021-04-03T15:50:17Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z) - Image-to-Image Translation with Text Guidance [139.41321867508722]
The goal of this paper is to embed controllable factors, i.e., natural language descriptions, into image-to-image translation with generative adversarial networks.
We propose four key components: (1) the implementation of part-of-speech tagging to filter out non-semantic words in the given description, (2) the adoption of an affine combination module to effectively fuse different modality text and image features, and (3) a novel refined multi-stage architecture to strengthen the differential ability of discriminators and the rectification ability of generators.
arXiv Detail & Related papers (2020-02-12T21:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.