Image-to-Image Translation with Text Guidance
- URL: http://arxiv.org/abs/2002.05235v1
- Date: Wed, 12 Feb 2020 21:09:15 GMT
- Title: Image-to-Image Translation with Text Guidance
- Authors: Bowen Li, Xiaojuan Qi, Philip H. S. Torr, Thomas Lukasiewicz
- Abstract summary: The goal of this paper is to embed controllable factors, i.e., natural language descriptions, into image-to-image translation with generative adversarial networks.
We propose four key components: (1) the implementation of part-of-speech tagging to filter out non-semantic words in the given description, (2) the adoption of an affine combination module to effectively fuse different modality text and image features, and (3) a novel refined multi-stage architecture to strengthen the differential ability of discriminators and the rectification ability of generators.
- Score: 139.41321867508722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this paper is to embed controllable factors, i.e., natural
language descriptions, into image-to-image translation with generative
adversarial networks, which allows text descriptions to determine the visual
attributes of synthetic images. We propose four key components: (1) the
implementation of part-of-speech tagging to filter out non-semantic words in
the given description, (2) the adoption of an affine combination module to
effectively fuse different modality text and image features, (3) a novel
refined multi-stage architecture to strengthen the differential ability of
discriminators and the rectification ability of generators, and (4) a new
structure loss to further improve discriminators to better distinguish real and
synthetic images. Extensive experiments on the COCO dataset demonstrate that
our method has a superior performance on both visual realism and semantic
consistency with given descriptions.
Related papers
- Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion
Models [68.47333676663312]
We show a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models.
The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens.
We illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.
arXiv Detail & Related papers (2024-02-21T03:01:17Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Vision-Language Matching for Text-to-Image Synthesis via Generative
Adversarial Networks [13.80433764370972]
Text-to-image synthesis aims to generate a photo-realistic and semantic consistent image from a specific text description.
We propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*.
The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods.
arXiv Detail & Related papers (2022-08-20T03:34:04Z) - StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis [52.341186561026724]
Lacking compositionality could have severe implications for robustness and fairness.
We introduce a new framework, StyleT2I, to improve the compositionality of text-to-image synthesis.
Results show that StyleT2I outperforms previous approaches in terms of consistency between the input text and synthesized images.
arXiv Detail & Related papers (2022-03-29T17:59:50Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Improving Text-to-Image Synthesis Using Contrastive Learning [4.850820365312369]
We propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images.
We evaluate our approach over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on datasets CUB and COCO.
arXiv Detail & Related papers (2021-07-06T06:43:31Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.