Efficient Neural Architecture for Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2004.11437v1
- Date: Thu, 23 Apr 2020 19:33:40 GMT
- Title: Efficient Neural Architecture for Text-to-Image Synthesis
- Authors: Douglas M. Souza, J\^onatas Wehrmann, Duncan D. Ruiz
- Abstract summary: We show that an effective neural architecture can achieve state-of-the-art performance using a single stage training with a single generator and a single discriminator.
Our work points a new direction for text-to-image research, which has not experimented with novel neural architectures recently.
- Score: 6.166295570030645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image synthesis is the task of generating images from text
descriptions. Image generation, by itself, is a challenging task. When we
combine image generation and text, we bring complexity to a new level: we need
to combine data from two different modalities. Most of recent works in
text-to-image synthesis follow a similar approach when it comes to neural
architectures. Due to aforementioned difficulties, plus the inherent difficulty
of training GANs at high resolutions, most methods have adopted a multi-stage
training strategy. In this paper we shift the architectural paradigm currently
used in text-to-image methods and show that an effective neural architecture
can achieve state-of-the-art performance using a single stage training with a
single generator and a single discriminator. We do so by applying deep residual
networks along with a novel sentence interpolation strategy that enables
learning a smooth conditional space. Finally, our work points a new direction
for text-to-image research, which has not experimented with novel neural
architectures recently.
Related papers
- Advanced Multimodal Deep Learning Architecture for Image-Text Matching [33.8315200009152]
Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship.
We introduce an advanced multimodal deep learning architecture, which combines the high-level abstract representation ability of deep neural networks for visual information with the advantages of natural language processing models for text semantic understanding.
Experiments show that compared with existing image-text matching models, the optimized new model has significantly improved performance on a series of benchmark data sets.
arXiv Detail & Related papers (2024-06-13T08:32:24Z) - Style Generation: Image Synthesis based on Coarsely Matched Texts [10.939482612568433]
We introduce a novel task called text-based style generation and propose a two-stage generative adversarial network.
The first stage generates the overall image style with a sentence feature, and the second stage refines the generated style with a synthetic feature.
The practical potential of our work is demonstrated by various applications such as text-image alignment and story visualization.
arXiv Detail & Related papers (2023-09-08T21:51:11Z) - Self-Supervised Image-to-Text and Text-to-Image Synthesis [23.587581181330123]
We propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces.
In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder.
arXiv Detail & Related papers (2021-12-09T13:54:56Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation.
We learn a latent embedding, jointly with the generator, that models the variability of the output domain.
Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z) - RTIC: Residual Learning for Text and Image Composition using Graph
Convolutional Network [19.017377597937617]
We study the compositional learning of images and texts for image retrieval.
We introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods.
arXiv Detail & Related papers (2021-04-07T09:41:52Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - NAS-DIP: Learning Deep Image Prior with Neural Architecture Search [65.79109790446257]
Recent work has shown that the structure of deep convolutional neural networks can be used as a structured image prior.
We propose to search for neural architectures that capture stronger image priors.
We search for an improved network by leveraging an existing neural architecture search algorithm.
arXiv Detail & Related papers (2020-08-26T17:59:36Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z) - Scene Text Synthesis for Efficient and Effective Deep Network Training [62.631176120557136]
We develop an innovative image synthesis technique that composes annotated training images by embedding foreground objects of interest into background images.
The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training.
Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique.
arXiv Detail & Related papers (2019-01-26T10:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.