Towards Better Text-Image Consistency in Text-to-Image Generation
- URL: http://arxiv.org/abs/2210.15235v1
- Date: Thu, 27 Oct 2022 07:47:47 GMT
- Title: Towards Better Text-Image Consistency in Text-to-Image Generation
- Authors: Zhaorui Tan, Zihan Ye, Xi Yang, Qiufeng Wang, Yuyao Yan, Kaizhu Huang
- Abstract summary: We develop a novel CLIP-based metric termed as Semantic Similarity Distance (SSD)
We further design the Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN), which can fuse semantic information at different granularities.
Our PDF-GAN can lead to significantly better text-image consistency while maintaining decent image quality on the CUB and COCO datasets.
- Score: 15.735515302139335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating consistent and high-quality images from given texts is essential
for visual-language understanding. Although impressive results have been
achieved in generating high-quality images, text-image consistency is still a
major concern in existing GAN-based methods. Particularly, the most popular
metric $R$-precision may not accurately reflect the text-image consistency,
often resulting in very misleading semantics in the generated images. Albeit
its significance, how to design a better text-image consistency metric
surprisingly remains under-explored in the community. In this paper, we make a
further step forward to develop a novel CLIP-based metric termed as Semantic
Similarity Distance (SSD), which is both theoretically founded from a
distributional viewpoint and empirically verified on benchmark datasets.
Benefiting from the proposed metric, we further design the Parallel Deep Fusion
Generative Adversarial Networks (PDF-GAN), which can fuse semantic information
at different granularities and capture accurate semantics. Equipped with two
novel plug-and-play components: Hard-Negative Sentence Constructor and Semantic
Projection, the proposed PDF-GAN can mitigate inconsistent semantics and bridge
the text-image semantic gap. A series of experiments show that, as opposed to
current state-of-the-art methods, our PDF-GAN can lead to significantly better
text-image consistency while maintaining decent image quality on the CUB and
COCO datasets.
Related papers
- Language-Oriented Semantic Latent Representation for Image Transmission [38.62941652189033]
New paradigm of semantic communication (SC) focuses on delivering meanings behind bits.
Recent advances in data-to-text models facilitate language-oriented SC.
We propose a novel SC framework that communicates both text and a compressed image embedding.
arXiv Detail & Related papers (2024-05-16T10:41:31Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z) - RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff.
In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt.
The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.