LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation
- URL: http://arxiv.org/abs/2308.05095v2
- Date: Sat, 12 Aug 2023 05:36:42 GMT
- Title: LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation
- Authors: Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, Tat-Seng Chua
- Abstract summary: We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
- Score: 121.45667242282721
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the text-to-image generation field, recent remarkable progress in Stable
Diffusion makes it possible to generate rich kinds of novel photorealistic
images. However, current models still face misalignment issues (e.g.,
problematic spatial relation understanding and numeration failure) in complex
natural scenes, which impedes the high-faithfulness text-to-image generation.
Although recent efforts have been made to improve controllability by giving
fine-grained guidance (e.g., sketch and scribbles), this issue has not been
fundamentally tackled since users have to provide such guidance information
manually. In this work, we strive to synthesize high-fidelity images that are
semantically aligned with a given textual prompt without any guidance. Toward
this end, we propose a coarse-to-fine paradigm to achieve layout planning and
image generation. Concretely, we first generate the coarse-grained layout
conditioned on a given textual prompt via in-context learning based on Large
Language Models. Afterward, we propose a fine-grained object-interaction
diffusion method to synthesize high-faithfulness images conditioned on the
prompt and the automatically generated layout. Extensive experiments
demonstrate that our proposed method outperforms the state-of-the-art models in
terms of layout and image generation. Our code and settings are available at
https://layoutllm-t2i.github.io.
Related papers
- UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z) - GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures
in Text-to-Image Generation [18.396131717250793]
We introduce GlyphDraw, a general learning framework aiming to endow image generation models with the capacity to generate images coherently embedded with text for any specific language.
Our method not only produces accurate language characters as in prompts, but also seamlessly blends the generated text into the background.
arXiv Detail & Related papers (2023-03-31T08:06:33Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - AI Illustrator: Translating Raw Descriptions into Images by Prompt-based
Cross-Modal Generation [61.77946020543875]
We propose a framework for translating raw descriptions with complex semantics into semantically corresponding images.
Our framework consists of two components: a projection module from Text Embeddings to Image Embeddings based on prompts, and an adapted image generation module built on StyleGAN.
Benefiting from the pre-trained models, our method can handle complex descriptions and does not require external paired data for training.
arXiv Detail & Related papers (2022-09-07T13:53:54Z) - DT2I: Dense Text-to-Image Generation from Region Descriptions [3.883984493622102]
We introduce dense text-to-image (DT2I) synthesis as a new task to pave the way toward more intuitive image generation.
We also propose DTC-GAN, a novel method to generate images from semantically rich region descriptions.
arXiv Detail & Related papers (2022-04-05T07:57:11Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.