Text-to-Image Cross-Modal Generation: A Systematic Review
- URL: http://arxiv.org/abs/2401.11631v1
- Date: Sun, 21 Jan 2024 23:54:05 GMT
- Title: Text-to-Image Cross-Modal Generation: A Systematic Review
- Authors: Maciej \.Zelaszczyk, Jacek Ma\'ndziuk
- Abstract summary: We review research on generating visual data from text from the angle of "cross-modal generation"
We provide a breakdown of text-to-image generation into various flavors of image-from-text methods, video-from-text methods, image editing, self-supervised and graph-based approaches.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We review research on generating visual data from text from the angle of
"cross-modal generation." This point of view allows us to draw parallels
between various methods geared towards working on input text and producing
visual output, without limiting the analysis to narrow sub-areas. It also
results in the identification of common templates in the field, which are then
compared and contrasted both within pools of similar methods and across lines
of research. We provide a breakdown of text-to-image generation into various
flavors of image-from-text methods, video-from-text methods, image editing,
self-supervised and graph-based approaches. In this discussion, we focus on
research papers published at 8 leading machine learning conferences in the
years 2016-2022, also incorporating a number of relevant papers not matching
the outlined search criteria. The conducted review suggests a significant
increase in the number of papers published in the area and highlights research
gaps and potential lines of investigation. To our knowledge, this is the first
review to systematically look at text-to-image generation from the perspective
of "cross-modal generation."
Related papers
- Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation [5.55027585813848]
The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications.
We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text.
We demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores.
arXiv Detail & Related papers (2024-03-25T04:54:49Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Text-to-image Diffusion Models in Generative AI: A Survey [75.32882187215394]
We present a review of state-of-the-art methods on text-conditioned image synthesis, i.e., text-to-image.
We discuss applications beyond text-to-image generation: text-guided creative generation and text-guided image editing.
arXiv Detail & Related papers (2023-03-14T13:49:54Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Where Does the Performance Improvement Come From? - A Reproducibility
Concern about Image-Text Retrieval [85.03655458677295]
Image-text retrieval has gradually become a major research direction in the field of information retrieval.
We first examine the related concerns and why the focus is on image-text retrieval tasks.
We analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models.
arXiv Detail & Related papers (2022-03-08T05:01:43Z) - Self-Supervised Image-to-Text and Text-to-Image Synthesis [23.587581181330123]
We propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces.
In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder.
arXiv Detail & Related papers (2021-12-09T13:54:56Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Survey of Visual-Semantic Embedding Methods for Zero-Shot Image
Retrieval [0.6091702876917279]
We focus on zero-shot image retrieval using sentences as queries and present a survey of the technological trends in this area.
We provide a comprehensive overview of the history of the technology, starting with a discussion of the early studies of image-to-text matching.
A description of the datasets commonly used in experiments and a comparison of the evaluation results of each method are presented.
arXiv Detail & Related papers (2021-05-16T09:43:25Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.