Reason out Your Layout: Evoking the Layout Master from Large Language
Models for Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2311.17126v1
- Date: Tue, 28 Nov 2023 14:51:13 GMT
- Title: Reason out Your Layout: Evoking the Layout Master from Large Language
Models for Text-to-Image Synthesis
- Authors: Xiaohui Chen, Yongfei Liu, Yingxiang Yang, Jianbo Yuan, Quanzeng You,
Li-Ping Liu, Hongxia Yang
- Abstract summary: We introduce a novel approach to improving T2I diffusion models using Large Language Models (LLMs) as layout generators.
Our experiments demonstrate significant improvements in image quality and layout accuracy.
- Score: 47.27044390204868
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in text-to-image (T2I) generative models have shown
remarkable capabilities in producing diverse and imaginative visuals based on
text prompts. Despite the advancement, these diffusion models sometimes
struggle to translate the semantic content from the text into images entirely.
While conditioning on the layout has shown to be effective in improving the
compositional ability of T2I diffusion models, they typically require manual
layout input. In this work, we introduce a novel approach to improving T2I
diffusion models using Large Language Models (LLMs) as layout generators. Our
method leverages the Chain-of-Thought prompting of LLMs to interpret text and
generate spatially reasonable object layouts. The generated layout is then used
to enhance the generated images' composition and spatial accuracy. Moreover, we
propose an efficient adapter based on a cross-attention mechanism, which
explicitly integrates the layout information into the stable diffusion models.
Our experiments demonstrate significant improvements in image quality and
layout accuracy, showcasing the potential of LLMs in augmenting generative
image models.
Related papers
- ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models [52.23899502520261]
We introduce a new framework named ARTIST to focus on the learning of text structures.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - DreamDistribution: Prompt Distribution Learning for Text-to-Image
Diffusion Models [53.17454737232668]
We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts.
These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions.
We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D.
arXiv Detail & Related papers (2023-12-21T12:11:00Z) - R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image
Generation [74.5598315066249]
We probe into zero-shot grounded T2I generation with diffusion models.
We propose a Region and Boundary (R&B) aware cross-attention guidance approach.
arXiv Detail & Related papers (2023-10-13T05:48:42Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - DiffUTE: Universal Text Editing Diffusion Model [32.384236053455]
We propose a universal self-supervised text editing diffusion model (DiffUTE)
It aims to replace or modify words in the source image with another one while maintaining its realistic appearance.
Our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.
arXiv Detail & Related papers (2023-05-18T09:06:01Z) - LayoutDiffuse: Adapting Foundational Diffusion Models for
Layout-to-Image Generation [24.694298869398033]
Our method trains efficiently, generates images with both high perceptual quality and layout alignment.
Our method significantly outperforms other 10 generative models based on GANs, VQ-VAE, and diffusion models.
arXiv Detail & Related papers (2023-02-16T14:20:25Z) - PRedItOR: Text Guided Image Editing with Diffusion Prior [2.3022070933226217]
Text guided image editing requires compute intensive optimization of text embeddings or fine-tuning the model weights for text guided image editing.
Our architecture consists of a diffusion prior model that generates CLIP image embedding conditioned on a text prompt and a custom Latent Diffusion Model trained to generate images conditioned on CLIP image embedding.
We combine this with structure preserving edits on the image decoder using existing approaches such as reverse DDIM to perform text guided image editing.
arXiv Detail & Related papers (2023-02-15T22:58:11Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.