PerceptionGAN: Real-world Image Construction from Provided Text through
Perceptual Understanding
- URL: http://arxiv.org/abs/2007.00977v1
- Date: Thu, 2 Jul 2020 09:23:08 GMT
- Title: PerceptionGAN: Real-world Image Construction from Provided Text through
Perceptual Understanding
- Authors: Kanish Garg, Ajeet kumar Singh, Dorien Herremans, Brejesh Lall
- Abstract summary: We propose a method to provide good images by incorporating perceptual understanding in the discriminator module.
We show that the perceptual information included in the initial image is improved while modeling image distribution at multiple stages.
More importantly, the proposed method can be integrated into the pipeline of other state-of-the-art text-based-image-generation models.
- Score: 11.985768957782641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating an image from a provided descriptive text is quite a challenging
task because of the difficulty in incorporating perceptual information (object
shapes, colors, and their interactions) along with providing high relevancy
related to the provided text. Current methods first generate an initial
low-resolution image, which typically has irregular object shapes, colors, and
interaction between objects. This initial image is then improved by
conditioning on the text. However, these methods mainly address the problem of
using text representation efficiently in the refinement of the initially
generated image, while the success of this refinement process depends heavily
on the quality of the initially generated image, as pointed out in the DM-GAN
paper. Hence, we propose a method to provide good initialized images by
incorporating perceptual understanding in the discriminator module. We improve
the perceptual information at the first stage itself, which results in
significant improvement in the final generated image. In this paper, we have
applied our approach to the novel StackGAN architecture. We then show that the
perceptual information included in the initial image is improved while modeling
image distribution at multiple stages. Finally, we generated realistic
multi-colored images conditioned by text. These images have good quality along
with containing improved basic perceptual information. More importantly, the
proposed method can be integrated into the pipeline of other state-of-the-art
text-based-image-generation models to generate initial low-resolution images.
We also worked on improving the refinement process in StackGAN by augmenting
the third stage of the generator-discriminator pair in the StackGAN
architecture. Our experimental analysis and comparison with the
state-of-the-art on a large but sparse dataset MS COCO further validate the
usefulness of our proposed approach.
Related papers
- Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions.
Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image.
We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z) - CoSeR: Bridging Image and Language for Cognitive Super-Resolution [74.24752388179992]
We introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images.
We achieve this by marrying image appearance and language understanding to generate a cognitive embedding.
To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention"
arXiv Detail & Related papers (2023-11-27T16:33:29Z) - Improving Scene Text Image Super-resolution via Dual Prior Modulation
Network [20.687100711699788]
Scene text image super-resolution (STISR) aims to simultaneously increase the resolution and legibility of the text images.
Existing approaches neglect the global structure of the text, which bounds the semantic determinism of the scene text.
Our work proposes a plug-and-play module dubbed Dual Prior Modulation Network (DPMN), which leverages dual image-level priors to bring performance gain over existing approaches.
arXiv Detail & Related papers (2023-02-21T02:59:37Z) - Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z) - ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel
Appearance Invariant Semantic Representations [77.3590853897664]
This work presents a self-supervised method to learn dense semantically rich visual embeddings for images inspired by methods for learning word embeddings in NLP.
arXiv Detail & Related papers (2021-11-24T12:27:30Z) - Multi-Tailed, Multi-Headed, Spatial Dynamic Memory refined Text-to-Image
Synthesis [21.673771194165276]
Current methods synthesize images from text in a multi-stage manner, typically by first generating a rough initial image and then refining image details at subsequent stages.
Our proposed method introduces three novel components to address these shortcomings.
Experimental results demonstrate that our Multi-Headed Spatial Dynamic Memory image refinement with our Multi-Tailed Word-level Initial Generation (MSMT-GAN) performs favourably against the previous state of the art on the CUB and COCO datasets.
arXiv Detail & Related papers (2021-10-15T15:16:58Z) - DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z) - Learned Spatial Representations for Few-shot Talking-Head Synthesis [68.3787368024951]
We propose a novel approach for few-shot talking-head synthesis.
We show that this disentangled representation leads to a significant improvement over previous methods.
arXiv Detail & Related papers (2021-04-29T17:59:42Z) - RTIC: Residual Learning for Text and Image Composition using Graph
Convolutional Network [19.017377597937617]
We study the compositional learning of images and texts for image retrieval.
We introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods.
arXiv Detail & Related papers (2021-04-07T09:41:52Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.