Paragraph-to-Image Generation with Information-Enriched Diffusion Model
- URL: http://arxiv.org/abs/2311.14284v2
- Date: Wed, 29 Nov 2023 12:01:35 GMT
- Title: Paragraph-to-Image Generation with Information-Enriched Diffusion Model
- Authors: Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele
Cheng, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang
- Abstract summary: ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.
It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.
The code and dataset will be released to foster community research on long-text alignment.
- Score: 67.9265336953134
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image (T2I) models have recently experienced rapid development,
achieving astonishing performance in terms of fidelity and textual alignment
capabilities. However, given a long paragraph (up to 512 words), these
generation models still struggle to achieve strong alignment and are unable to
generate images depicting complex scenes. In this paper, we introduce an
information-enriched diffusion model for paragraph-to-image generation task,
termed ParaDiffusion, which delves into the transference of the extensive
semantic comprehension capabilities of large language models to the task of
image generation. At its core is using a large language model (e.g., Llama V2)
to encode long-form text, followed by fine-tuning with LORA to alignthe
text-image feature spaces in the generation task. To facilitate the training of
long-text semantic alignment, we also curated a high-quality paragraph-image
pair dataset, namely ParaImage. This dataset contains a small amount of
high-quality, meticulously annotated data, and a large-scale synthetic dataset
with long text descriptions being generated using a vision-language model.
Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models
(SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to 15% and 45%
human voting rate improvements for visual appeal and text faithfulness,
respectively. The code and dataset will be released to foster community
research on long-text alignment.
Related papers
- LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation [30.897935761304034]
We propose a novel framework called textbfLLM4GEN, which enhances the semantic understanding of text-to-image diffusion models.
A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features.
DensePrompts, which contains $7,000$ dense prompts, provides a comprehensive evaluation for the text-to-image generation task.
arXiv Detail & Related papers (2024-06-30T15:50:32Z) - CustomText: Customized Textual Image Generation using Diffusion Models [13.239661107392324]
Textual image generation spans diverse fields like advertising, education, product packaging, social media, information visualization, and branding.
Despite recent strides in language-guided image synthesis using diffusion models, current models excel in image generation but struggle with accurate text rendering and offer limited control over font attributes.
In this paper, we aim to enhance the synthesis of high-quality images with precise text customization, thereby contributing to the advancement of image generation models.
arXiv Detail & Related papers (2024-05-21T06:43:03Z) - DOCCI: Descriptions of Connected and Contrasting Images [58.377060316967864]
Descriptions of Connected and Contrasting Images (DOCCI) is a dataset with long, human-annotated English descriptions for 15k images.
We instruct human annotators to create comprehensive descriptions for each image.
We show that DOCCI is a useful testbed for text-to-image generation.
arXiv Detail & Related papers (2024-04-30T17:56:24Z) - COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training [119.03392147066093]
Recent autoregressive vision-language models have excelled in few-shot text generation tasks but face challenges in alignment tasks.
We introduce the contrastive loss into text generation models, partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components.
To bridge this gap, this work introduces VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions.
arXiv Detail & Related papers (2024-01-01T18:58:42Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation [22.47279425592133]
We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation.
For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
arXiv Detail & Related papers (2021-12-31T03:53:33Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.