Improving Text-to-Image Synthesis Using Contrastive Learning
- URL: http://arxiv.org/abs/2107.02423v1
- Date: Tue, 6 Jul 2021 06:43:31 GMT
- Title: Improving Text-to-Image Synthesis Using Contrastive Learning
- Authors: Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderraman, Shihao Ji
- Abstract summary: We propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images.
We evaluate our approach over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on datasets CUB and COCO.
- Score: 4.850820365312369
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of text-to-image synthesis is to generate a visually realistic image
that matches a given text description. In practice, the captions annotated by
humans for the same image have large variance in terms of contents and the
choice of words. The linguistic discrepancy between the captions of the
identical image leads to the synthetic images deviating from the ground truth.
To address this issue, we propose a contrastive learning approach to improve
the quality and enhance the semantic consistency of synthetic images. In the
pre-training stage, we utilize the contrastive learning approach to learn the
consistent textual representations for the captions corresponding to the same
image. Furthermore, in the following stage of GAN training, we employ the
contrastive learning method to enhance the consistency between the generated
images from the captions related to the same image. We evaluate our approach
over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on
datasets CUB and COCO, respectively. Experimental results have shown that our
approach can effectively improve the quality of synthetic images in terms of
three metrics: IS, FID and R-precision. Especially, on the challenging COCO
dataset, our approach boosts the FID significantly by 29.60% over AttnGAn and
by 21.96% over DM-GAN.
Related papers
- Learning Vision from Models Rivals Learning Vision from Data [54.43596959598465]
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions.
We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption.
We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs.
arXiv Detail & Related papers (2023-12-28T18:59:55Z) - The Right Losses for the Right Gains: Improving the Semantic Consistency
of Deep Text-to-Image Generation with Distribution-Sensitive Losses [0.35898124827270983]
We propose a contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss and fake-to-real loss.
We test this approach on two baseline models: SSAGAN and AttnGAN.
Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset.
arXiv Detail & Related papers (2023-12-18T00:05:28Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual
Representation Learners [58.941838860425754]
We show that training self-supervised methods on synthetic images can match or beat the real image counterpart.
We develop a multi-positive contrastive learning method, which we call StableRep.
With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP.
arXiv Detail & Related papers (2023-06-01T17:59:51Z) - Vision-Language Matching for Text-to-Image Synthesis via Generative
Adversarial Networks [13.80433764370972]
Text-to-image synthesis aims to generate a photo-realistic and semantic consistent image from a specific text description.
We propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*.
The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods.
arXiv Detail & Related papers (2022-08-20T03:34:04Z) - StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis [52.341186561026724]
Lacking compositionality could have severe implications for robustness and fairness.
We introduce a new framework, StyleT2I, to improve the compositionality of text-to-image synthesis.
Results show that StyleT2I outperforms previous approaches in terms of consistency between the input text and synthesized images.
arXiv Detail & Related papers (2022-03-29T17:59:50Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Image-to-Image Translation with Text Guidance [139.41321867508722]
The goal of this paper is to embed controllable factors, i.e., natural language descriptions, into image-to-image translation with generative adversarial networks.
We propose four key components: (1) the implementation of part-of-speech tagging to filter out non-semantic words in the given description, (2) the adoption of an affine combination module to effectively fuse different modality text and image features, and (3) a novel refined multi-stage architecture to strengthen the differential ability of discriminators and the rectification ability of generators.
arXiv Detail & Related papers (2020-02-12T21:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.