Adversarial Text-to-Image Synthesis: A Review
- URL: http://arxiv.org/abs/2101.09983v1
- Date: Mon, 25 Jan 2021 09:58:36 GMT
- Title: Adversarial Text-to-Image Synthesis: A Review
- Authors: Stanislav Frolov, Tobias Hinz, Federico Raue, J\"orn Hees, Andreas
Dengel
- Abstract summary: We contextualize the state of the art of adversarial text-to-image synthesis models, their development since their inception five years ago, and propose a taxonomy based on the level of supervision.
We critically examine current strategies to evaluate text-to-image synthesis models, highlight shortcomings, and identify new areas of research, ranging from the development of better datasets and evaluation metrics to possible improvements in architectural design and model training.
This review complements previous surveys on generative adversarial networks with a focus on text-to-image synthesis which we believe will help researchers to further advance the field.
- Score: 7.593633267653624
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With the advent of generative adversarial networks, synthesizing images from
textual descriptions has recently become an active research area. It is a
flexible and intuitive way for conditional image generation with significant
progress in the last years regarding visual realism, diversity, and semantic
alignment. However, the field still faces several challenges that require
further research efforts such as enabling the generation of high-resolution
images with multiple objects, and developing suitable and reliable evaluation
metrics that correlate with human judgement. In this review, we contextualize
the state of the art of adversarial text-to-image synthesis models, their
development since their inception five years ago, and propose a taxonomy based
on the level of supervision. We critically examine current strategies to
evaluate text-to-image synthesis models, highlight shortcomings, and identify
new areas of research, ranging from the development of better datasets and
evaluation metrics to possible improvements in architectural design and model
training. This review complements previous surveys on generative adversarial
networks with a focus on text-to-image synthesis which we believe will help
researchers to further advance the field.
Related papers
- KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models.
We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals.
Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z) - A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing [8.171572460041823]
Talking head synthesis is an advanced method for generating portrait videos from a still image driven by specific content.
This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques.
arXiv Detail & Related papers (2024-06-15T08:14:59Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis [21.619269792415903]
We present an empirical study introducing a nuanced evaluation framework for text-to-image (T2I) generative models.
Our framework categorizes evaluations into two distinct groups: first, focusing on image qualities such as aesthetics and realism, and second, examining text conditions through concept coverage and fairness.
arXiv Detail & Related papers (2024-03-08T07:41:47Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - Where Does the Performance Improvement Come From? - A Reproducibility
Concern about Image-Text Retrieval [85.03655458677295]
Image-text retrieval has gradually become a major research direction in the field of information retrieval.
We first examine the related concerns and why the focus is on image-text retrieval tasks.
We analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models.
arXiv Detail & Related papers (2022-03-08T05:01:43Z) - Self-Supervised Image-to-Text and Text-to-Image Synthesis [23.587581181330123]
We propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces.
In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder.
arXiv Detail & Related papers (2021-12-09T13:54:56Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.