RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model
- URL: http://arxiv.org/abs/2309.00810v1
- Date: Sat, 2 Sep 2023 03:27:20 GMT
- Title: RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model
- Authors: Fengxiang Bie, Yibo Yang, Zhongzhu Zhou, Adam Ghanem, Minjia Zhang,
Zhewei Yao, Xiaoxia Wu, Connor Holmes, Pareesa Golnari, David A. Clifton,
Yuxiong He, Dacheng Tao, Shuaiwen Leon Song
- Abstract summary: Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps.
In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
- Score: 93.8067369210696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image generation (TTI) refers to the usage of models that could
process text input and generate high fidelity images based on text
descriptions. Text-to-image generation using neural networks could be traced
back to the emergence of Generative Adversial Network (GAN), followed by the
autoregressive Transformer. Diffusion models are one prominent type of
generative model used for the generation of images through the systematic
introduction of noises with repeating steps. As an effect of the impressive
results of diffusion models on image synthesis, it has been cemented as the
major image decoder used by text-to-image models and brought text-to-image
generation to the forefront of machine-learning (ML) research. In the era of
large models, scaling up model size and the integration with large language
models have further improved the performance of TTI models, resulting the
generation result nearly indistinguishable from real-world images,
revolutionizing the way we retrieval images. Our explorative study has
incentivised us to think that there are further ways of scaling text-to-image
models with the combination of innovative model architectures and prediction
enhancement techniques. We have divided the work of this survey into five main
sections wherein we detail the frameworks of major literature in order to delve
into the different types of text-to-image generation methods. Following this we
provide a detailed comparison and critique of these methods and offer possible
pathways of improvement for future work. In the future work, we argue that TTI
development could yield impressive productivity improvements for creation,
particularly in the context of the AIGC era, and could be extended to more
complex tasks such as video generation and 3D generation.
Related papers
- Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - YaART: Yet Another ART Rendering Technology [119.09155882164573]
This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences.
We analyze how these choices affect both the efficiency of the training process and the quality of the generated images.
We demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets.
arXiv Detail & Related papers (2024-04-08T16:51:19Z) - Bridging Different Language Models and Generative Vision Models for
Text-to-Image Generation [12.024554708901514]
We propose LaVi-Bridge, a pipeline that enables the integration of diverse pre-trained language models and generative vision models for text-to-image generation.
Our pipeline is compatible with various language models and generative vision models, accommodating different structures.
arXiv Detail & Related papers (2024-03-12T17:50:11Z) - Diffusion idea exploration for art generation [0.10152838128195467]
Diffusion models have recently outperformed other generative models in image generation tasks using cross modal data as guiding information.
The initial experiments for this task of novel image generation demonstrated promising qualitative results.
arXiv Detail & Related papers (2023-07-11T02:35:26Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Textile Pattern Generation Using Diffusion Models [0.0]
This study presents a fine-tuned diffusion model specifically trained for textile pattern generation by text guidance.
The proposed fine-tuned diffusion model outperforms the baseline models in terms of pattern quality and efficiency in textile pattern generation by text guidance.
arXiv Detail & Related papers (2023-04-02T12:12:24Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - PAGER: Progressive Attribute-Guided Extendable Robust Image Generation [38.484332924924914]
This work presents a generative modeling approach based on successive subspace learning (SSL)
Unlike most generative models in the literature, our method does not utilize neural networks to analyze the underlying source distribution and synthesize images.
The resulting method, called the progressive-guided extendable robust image generative (R) model, has advantages in mathematical transparency, progressive content generation, lower training time, robust performance with fewer training samples, and extendibility to conditional image generation.
arXiv Detail & Related papers (2022-06-01T00:35:42Z) - ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation [22.47279425592133]
We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation.
For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
arXiv Detail & Related papers (2021-12-31T03:53:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.