Energy-Based Cross Attention for Bayesian Context Update in
Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2306.09869v3
- Date: Sat, 4 Nov 2023 18:18:10 GMT
- Title: Energy-Based Cross Attention for Bayesian Context Update in
Text-to-Image Diffusion Models
- Authors: Geon Yeong Park, Jeongsol Kim, Beomsu Kim, Sang Wan Lee, Jong Chul Ye
- Abstract summary: We present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors.
Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder.
Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts.
- Score: 62.603753097900466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the remarkable performance of text-to-image diffusion models in image
generation tasks, recent studies have raised the issue that generated images
sometimes cannot capture the intended semantic contents of the text prompts,
which phenomenon is often called semantic misalignment. To address this, here
we present a novel energy-based model (EBM) framework for adaptive context
control by modeling the posterior of context vectors. Specifically, we first
formulate EBMs of latent image representations and text embeddings in each
cross-attention layer of the denoising autoencoder. Then, we obtain the
gradient of the log posterior of context vectors, which can be updated and
transferred to the subsequent cross-attention layer, thereby implicitly
minimizing a nested hierarchy of energy functions. Our latent EBMs further
allow zero-shot compositional generation as a linear combination of
cross-attention outputs from different contexts. Using extensive experiments,
we demonstrate that the proposed method is highly effective in handling various
image generation tasks, including multi-concept generation, text-guided image
inpainting, and real and synthetic image editing. Code:
https://github.com/EnergyAttention/Energy-Based-CrossAttention.
Related papers
- Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions.
Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image.
We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z) - Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models [36.984151318293726]
We introduce an object-conditioned Energy-Based Attention Map Alignment (EBAMA) method to address the aforementioned problems.
We show that an object-centric attribute binding loss naturally emerges by maximizing the log-likelihood of a $z$- parameterized energy-based model.
Our approach shows great promise in further enhancing the text-controlled image editing ability of diffusion models.
arXiv Detail & Related papers (2024-04-10T23:30:54Z) - Image Inpainting via Tractable Steering of Diffusion Models [54.13818673257381]
This paper proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to exactly and efficiently compute the constrained posterior.
Specifically, this paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs)
We show that our approach can consistently improve the overall quality and semantic coherence of inpainted images with only 10% additional computational overhead.
arXiv Detail & Related papers (2023-11-28T21:14:02Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.