Related papers: Compositional Text-to-Image Generation with Dense Blob Representations

Compositional Text-to-Image Generation with Dense Blob Representations

URL: http://arxiv.org/abs/2405.08246v1
Date: Tue, 14 May 2024 00:22:06 GMT
Title: Compositional Text-to-Image Generation with Dense Blob Representations
Authors: Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat,
Abstract summary: Existing text-to-image models struggle to follow complex text prompts. We develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Our experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO.
Score: 48.1976291999674
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: https://blobgen-2d.github.io.

Related papers

UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation [51.31795451147935]
We present a unified generative model that supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework.<n>Our goal is to achieve unification along three axes: the model, the tasks, and the representations.<n> Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment.
arXiv Detail & Related papers (2025-11-21T03:02:10Z)
Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions [33.440447854396446]
We train the first open-source text-to-image model on long structured captions.<n>To process long captions efficiently, we propose DimFusion.<n>We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol.
arXiv Detail & Related papers (2025-11-10T09:25:25Z)
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects. We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z)
LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation [30.897935761304034]
We propose a novel framework called textbfLLM4GEN, which enhances the semantic understanding of text-to-image diffusion models. A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features. DensePrompts, which contains $7,000$ dense prompts, provides a comprehensive evaluation for the text-to-image generation task.
arXiv Detail & Related papers (2024-06-30T15:50:32Z)
Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions. Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image. We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z)
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs [77.86214400258473]
We propose a new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG) RPG harnesses the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our framework exhibits wide compatibility with various MLLM architectures.
arXiv Detail & Related papers (2024-01-22T06:16:29Z)
LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts. We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z)
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation. Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences. To be more specific, both input texts and images are encoded into one unified multi-modal latent space. Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.