OptGAN: Optimizing and Interpreting the Latent Space of the Conditional
Text-to-Image GANs
- URL: http://arxiv.org/abs/2202.12929v1
- Date: Fri, 25 Feb 2022 20:00:33 GMT
- Title: OptGAN: Optimizing and Interpreting the Latent Space of the Conditional
Text-to-Image GANs
- Authors: Zhenxing Zhang and Lambert Schomaker
- Abstract summary: We study how to ensure that generated samples are believable, realistic or natural.
We present a novel algorithm which identifies semantically-understandable directions in the latent space of a conditional text-to-image GAN architecture.
- Score: 8.26410341981427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image generation intends to automatically produce a photo-realistic
image, conditioned on a textual description. It can be potentially employed in
the field of art creation, data augmentation, photo-editing, etc. Although many
efforts have been dedicated to this task, it remains particularly challenging
to generate believable, natural scenes. To facilitate the real-world
applications of text-to-image synthesis, we focus on studying the following
three issues: 1) How to ensure that generated samples are believable, realistic
or natural? 2) How to exploit the latent space of the generator to edit a
synthesized image? 3) How to improve the explainability of a text-to-image
generation framework? In this work, we constructed two novel data sets (i.e.,
the Good & Bad bird and face data sets) consisting of successful as well as
unsuccessful generated samples, according to strict criteria. To effectively
and efficiently acquire high-quality images by increasing the probability of
generating Good latent codes, we use a dedicated Good/Bad classifier for
generated images. It is based on a pre-trained front end and fine-tuned on the
basis of the proposed Good & Bad data set. After that, we present a novel
algorithm which identifies semantically-understandable directions in the latent
space of a conditional text-to-image GAN architecture by performing independent
component analysis on the pre-trained weight values of the generator.
Furthermore, we develop a background-flattening loss (BFL), to improve the
background appearance in the edited image. Subsequently, we introduce linear
interpolation analysis between pairs of keywords. This is extended into a
similar triangular `linguistic' interpolation in order to take a deep look into
what a text-to-image synthesis model has learned within the linguistic
embeddings. Our data set is available at
https://zenodo.org/record/6283798#.YhkN_ujMI2w.
Related papers
- Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild.
The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability.
The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z) - Beyond Generation: Harnessing Text to Image Models for Object Detection
and Segmentation [29.274362919954218]
We propose a new paradigm to automatically generate training data with accurate labels at scale.
The proposed approach decouples training data generation into foreground object generation, and contextually coherent background generation.
We demonstrate the advantages of our approach on five object detection and segmentation datasets.
arXiv Detail & Related papers (2023-09-12T04:41:45Z) - Style Generation: Image Synthesis based on Coarsely Matched Texts [10.939482612568433]
We introduce a novel task called text-based style generation and propose a two-stage generative adversarial network.
The first stage generates the overall image style with a sentence feature, and the second stage refines the generated style with a synthetic feature.
The practical potential of our work is demonstrated by various applications such as text-image alignment and story visualization.
arXiv Detail & Related papers (2023-09-08T21:51:11Z) - LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z) - Optimized latent-code selection for explainable conditional
text-to-image GANs [8.26410341981427]
We present a variety of techniques to take a deep look into the latent space and semantic space of the conditional text-to-image GANs model.
We propose a framework for finding good latent codes by utilizing a linear SVM.
arXiv Detail & Related papers (2022-04-27T03:12:55Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Learned Spatial Representations for Few-shot Talking-Head Synthesis [68.3787368024951]
We propose a novel approach for few-shot talking-head synthesis.
We show that this disentangled representation leads to a significant improvement over previous methods.
arXiv Detail & Related papers (2021-04-29T17:59:42Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z) - Intrinsic Autoencoders for Joint Neural Rendering and Intrinsic Image
Decomposition [67.9464567157846]
We propose an autoencoder for joint generation of realistic images from synthetic 3D models while simultaneously decomposing real images into their intrinsic shape and appearance properties.
Our experiments confirm that a joint treatment of rendering and decomposition is indeed beneficial and that our approach outperforms state-of-the-art image-to-image translation baselines both qualitatively and quantitatively.
arXiv Detail & Related papers (2020-06-29T12:53:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.