LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts
- URL: http://arxiv.org/abs/2310.10640v2
- Date: Sun, 25 Feb 2024 23:23:00 GMT
- Title: LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts
- Authors: Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, Salman Khan, Peter
Wonka
- Abstract summary: Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
- Score: 60.54912319612113
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Diffusion-based generative models have significantly advanced text-to-image
generation but encounter challenges when processing lengthy and intricate text
prompts describing complex scenes with multiple objects. While excelling in
generating images from short, single-object descriptions, these models often
struggle to faithfully capture all the nuanced details within longer and more
elaborate textual inputs. In response, we present a novel approach leveraging
Large Language Models (LLMs) to extract critical components from text prompts,
including bounding box coordinates for foreground objects, detailed textual
descriptions for individual objects, and a succinct background context. These
components form the foundation of our layout-to-image generation model, which
operates in two phases. The initial Global Scene Generation utilizes object
layouts and background context to create an initial scene but often falls short
in faithfully representing object characteristics as specified in the prompts.
To address this limitation, we introduce an Iterative Refinement Scheme that
iteratively evaluates and refines box-level content to align them with their
textual descriptions, recomposing objects as needed to ensure consistency. Our
evaluation on complex prompts featuring multiple objects demonstrates a
substantial improvement in recall compared to baseline diffusion models. This
is further validated by a user study, underscoring the efficacy of our approach
in generating coherent and detailed scenes from intricate textual inputs.
Related papers
- Object-level Visual Prompts for Compositional Image Generation [75.6085388740087]
We introduce a method for composing object-level visual prompts within a text-to-image diffusion model.
A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts.
We introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations.
arXiv Detail & Related papers (2025-01-02T18:59:44Z) - Nested Attention: Semantic-aware Attention Values for Concept Personalization [78.90196530697897]
We introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers.
Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.
arXiv Detail & Related papers (2025-01-02T18:52:11Z) - Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions.
Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image.
We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z) - MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [6.4680449907623006]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.
The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z) - Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding [26.768147543628096]
We propose a novel framework that emphasizes object and context comprehension inspired by human cognitive processes.
Our method achieves significant performance improvements on three benchmark datasets.
arXiv Detail & Related papers (2024-04-12T16:38:48Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - Divide and Conquer: Language Models can Plan and Self-Correct for
Compositional Text-to-Image Generation [72.6168579583414]
CompAgent is a training-free approach for compositional text-to-image generation with a large language model (LLM) agent as its core.
Our approach achieves more than 10% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation.
arXiv Detail & Related papers (2024-01-28T16:18:39Z) - Progressive Text-to-Image Diffusion with Soft Latent Direction [17.120153452025995]
This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image.
Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs.
arXiv Detail & Related papers (2023-09-18T04:01:25Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.