Vision-Language Matching for Text-to-Image Synthesis via Generative
Adversarial Networks
- URL: http://arxiv.org/abs/2208.09596v1
- Date: Sat, 20 Aug 2022 03:34:04 GMT
- Title: Vision-Language Matching for Text-to-Image Synthesis via Generative
Adversarial Networks
- Authors: Qingrong Cheng, Keyu Wen, Xiaodong Gu
- Abstract summary: Text-to-image synthesis aims to generate a photo-realistic and semantic consistent image from a specific text description.
We propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*.
The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods.
- Score: 13.80433764370972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image synthesis aims to generate a photo-realistic and semantic
consistent image from a specific text description. The images synthesized by
off-the-shelf models usually contain limited components compared with the
corresponding image and text description, which decreases the image quality and
the textual-visual consistency. To address this issue, we propose a novel
Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*,
which introduces a dual vision-language matching mechanism to strengthen the
image quality and semantic consistency. The dual vision-language matching
mechanism considers textual-visual matching between the generated image and the
corresponding text description, and visual-visual consistent constraints
between the synthesized image and the real image. Given a specific text
description, VLMGAN* firstly encodes it into textual features and then feeds
them to a dual vision-language matching-based generative model to synthesize a
photo-realistic and textual semantic consistent image. Besides, the popular
evaluation metrics for text-to-image synthesis are borrowed from simple image
generation, which mainly evaluates the reality and diversity of the synthesized
images. Therefore, we introduce a metric named Vision-Language Matching Score
(VLMS) to evaluate the performance of text-to-image synthesis which can
consider both the image quality and the semantic consistency between
synthesized image and the description. The proposed dual multi-level
vision-language matching strategy can be applied to other text-to-image
synthesis methods. We implement this strategy on two popular baselines, which
are marked with ${\text{VLMGAN}_{+\text{AttnGAN}}}$ and
${\text{VLMGAN}_{+\text{DFGAN}}}$. The experimental results on two widely-used
datasets show that the model achieves significant improvements over other
state-of-the-art methods.
Related papers
- Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis [52.341186561026724]
Lacking compositionality could have severe implications for robustness and fairness.
We introduce a new framework, StyleT2I, to improve the compositionality of text-to-image synthesis.
Results show that StyleT2I outperforms previous approaches in terms of consistency between the input text and synthesized images.
arXiv Detail & Related papers (2022-03-29T17:59:50Z) - DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z) - Improving Text-to-Image Synthesis Using Contrastive Learning [4.850820365312369]
We propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images.
We evaluate our approach over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on datasets CUB and COCO.
arXiv Detail & Related papers (2021-07-06T06:43:31Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - Image-to-Image Translation with Text Guidance [139.41321867508722]
The goal of this paper is to embed controllable factors, i.e., natural language descriptions, into image-to-image translation with generative adversarial networks.
We propose four key components: (1) the implementation of part-of-speech tagging to filter out non-semantic words in the given description, (2) the adoption of an affine combination module to effectively fuse different modality text and image features, and (3) a novel refined multi-stage architecture to strengthen the differential ability of discriminators and the rectification ability of generators.
arXiv Detail & Related papers (2020-02-12T21:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.