StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2203.15799v1
- Date: Tue, 29 Mar 2022 17:59:50 GMT
- Title: StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis
- Authors: Zhiheng Li, Martin Renqiang Min, Kai Li, Chenliang Xu
- Abstract summary: Lacking compositionality could have severe implications for robustness and fairness.
We introduce a new framework, StyleT2I, to improve the compositionality of text-to-image synthesis.
Results show that StyleT2I outperforms previous approaches in terms of consistency between the input text and synthesized images.
- Score: 52.341186561026724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although progress has been made for text-to-image synthesis, previous methods
fall short of generalizing to unseen or underrepresented attribute compositions
in the input text. Lacking compositionality could have severe implications for
robustness and fairness, e.g., inability to synthesize the face images of
underrepresented demographic groups. In this paper, we introduce a new
framework, StyleT2I, to improve the compositionality of text-to-image
synthesis. Specifically, we propose a CLIP-guided Contrastive Loss to better
distinguish different compositions among different sentences. To further
improve the compositionality, we design a novel Semantic Matching Loss and a
Spatial Constraint to identify attributes' latent directions for intended
spatial region manipulations, leading to better disentangled latent
representations of attributes. Based on the identified latent directions of
attributes, we propose Compositional Attribute Adjustment to adjust the latent
code, resulting in better compositionality of image synthesis. In addition, we
leverage the $\ell_2$-norm regularization of identified latent directions (norm
penalty) to strike a nice balance between image-text alignment and image
fidelity. In the experiments, we devise a new dataset split and an evaluation
metric to evaluate the compositionality of text-to-image synthesis models. The
results show that StyleT2I outperforms previous approaches in terms of the
consistency between the input text and synthesized images and achieves higher
fidelity.
Related papers
- ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components.
Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality.
We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z) - STAR: Scale-wise Text-to-image generation via Auto-Regressive representations [40.66170627483643]
We present STAR, a text-to-image model that employs scale-wise auto-regressive paradigm.
We show that STAR surpasses existing benchmarks in terms of fidelity,image text consistency, and aesthetic quality.
arXiv Detail & Related papers (2024-06-16T03:45:45Z) - PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis [62.29033292210752]
High-quality images with consistent semantics and layout remains a challenge.
We propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues.
Our approach performs favorably in terms of visual quality, semantic consistency, and layout alignment.
arXiv Detail & Related papers (2024-03-04T09:03:16Z) - T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional
Text-to-image Generation [62.71574695256264]
T2I-CompBench is a comprehensive benchmark for open-world compositional text-to-image generation.
We propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation.
We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS) to boost the compositional text-to-image generation abilities.
arXiv Detail & Related papers (2023-07-12T17:59:42Z) - SceneComposer: Any-Level Semantic Image Synthesis [80.55876413285587]
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels.
The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level.
We introduce several novel techniques to address the challenges coming with this new setup.
arXiv Detail & Related papers (2022-11-21T18:59:05Z) - Vision-Language Matching for Text-to-Image Synthesis via Generative
Adversarial Networks [13.80433764370972]
Text-to-image synthesis aims to generate a photo-realistic and semantic consistent image from a specific text description.
We propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*.
The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods.
arXiv Detail & Related papers (2022-08-20T03:34:04Z) - DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z) - Improving Text-to-Image Synthesis Using Contrastive Learning [4.850820365312369]
We propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images.
We evaluate our approach over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on datasets CUB and COCO.
arXiv Detail & Related papers (2021-07-06T06:43:31Z) - Image-to-Image Translation with Text Guidance [139.41321867508722]
The goal of this paper is to embed controllable factors, i.e., natural language descriptions, into image-to-image translation with generative adversarial networks.
We propose four key components: (1) the implementation of part-of-speech tagging to filter out non-semantic words in the given description, (2) the adoption of an affine combination module to effectively fuse different modality text and image features, and (3) a novel refined multi-stage architecture to strengthen the differential ability of discriminators and the rectification ability of generators.
arXiv Detail & Related papers (2020-02-12T21:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.