Related papers: Dense Text-to-Image Generation with Attention Modulation

Dense Text-to-Image Generation with Attention Modulation

URL: http://arxiv.org/abs/2308.12964v1
Date: Thu, 24 Aug 2023 17:59:01 GMT
Title: Dense Text-to-Image Generation with Attention Modulation
Authors: Yunji Kim, Jiyoung Lee, Jin-Hwa Kim, Jung-Woo Ha, Jun-Yan Zhu
Abstract summary: Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions. We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions. We achieve similar-quality visual results with models specifically trained with layout conditions.
Score: 49.287458275920514
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.

Related papers

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model. Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder. By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z)
Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling. We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z)
Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD) In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z)
RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff. In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z)
SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis [38.22195812238951]
We propose a novel guidance approach for the sampling process in the diffusion model. Our approach guides the model with semantic features from CLIP embeddings and enforces geometric constraints. Our results demonstrate the effectiveness of incorporating bounding box and segmentation map guidance in the diffusion model sampling process.
arXiv Detail & Related papers (2023-04-28T00:14:28Z)
SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map. We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story. Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities. We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z)
Guiding Attention using Partial-Order Relationships for Image Captioning [2.620091916172863]
A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions. A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space. The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
arXiv Detail & Related papers (2022-04-15T14:22:09Z)
More Control for Free! Image Synthesis with Semantic Diffusion Guidance [79.88929906247695]
Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from an example image. We introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both. We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis.
arXiv Detail & Related papers (2021-12-10T18:55:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.