Compositional Visual Generation with Composable Diffusion Models
- URL: http://arxiv.org/abs/2206.01714v1
- Date: Fri, 3 Jun 2022 17:47:04 GMT
- Title: Compositional Visual Generation with Composable Diffusion Models
- Authors: Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, Joshua B. Tenenbaum
- Abstract summary: We propose an alternative structured approach for compositional generation using diffusion models.
An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image.
The proposed method can generate scenes at test time that are substantially more complex than those seen in training.
- Score: 80.75258849913574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large text-guided diffusion models, such as DALLE-2, are able to generate
stunning photorealistic images given natural language descriptions. While such
models are highly flexible, they struggle to understand the composition of
certain concepts, such as confusing the attributes of different objects or
relations between objects. In this paper, we propose an alternative structured
approach for compositional generation using diffusion models. An image is
generated by composing a set of diffusion models, with each of them modeling a
certain component of the image. To do this, we interpret diffusion models as
energy-based models in which the data distributions defined by the energy
functions may be explicitly combined. The proposed method can generate scenes
at test time that are substantially more complex than those seen in training,
composing sentence descriptions, object relations, human facial attributes, and
even generalizing to new combinations that are rarely seen in the real world.
We further illustrate how our approach may be used to compose pre-trained
text-guided diffusion models and generate photorealistic images containing all
the details described in the input descriptions, including the binding of
certain object attributes that have been shown difficult for DALLE-2. These
results point to the effectiveness of the proposed method in promoting
structured generalization for visual generation.
Related papers
- Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning [40.06403155373455]
We propose a novel reinforcement learning framework for personalized text-to-image generation.
Our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment.
arXiv Detail & Related papers (2024-07-09T08:11:53Z) - FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior [50.0535198082903]
We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image.
We showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition.
arXiv Detail & Related papers (2024-07-06T03:35:43Z) - DiffusionPID: Interpreting Diffusion via Partial Information Decomposition [24.83767778658948]
We apply information-theoretic principles to decompose the input text prompt into its elementary components.
We analyze how individual tokens and their interactions shape the generated image.
We show that PID is a potent tool for evaluating and diagnosing text-to-image diffusion models.
arXiv Detail & Related papers (2024-06-07T18:17:17Z) - Diffexplainer: Towards Cross-modal Global Explanations with Diffusion Models [51.21351775178525]
DiffExplainer is a novel framework that, leveraging language-vision models, enables multimodal global explainability.
It employs diffusion models conditioned on optimized text prompts, synthesizing images that maximize class outputs.
The analysis of generated visual descriptions allows for automatic identification of biases and spurious features.
arXiv Detail & Related papers (2024-04-03T10:11:22Z) - A Phase Transition in Diffusion Models Reveals the Hierarchical Nature
of Data [55.748186000425996]
Recent advancements show that diffusion models can generate high-quality images.
We study this phenomenon in a hierarchical generative model of data.
Our analysis characterises the relationship between time and scale in diffusion models.
arXiv Detail & Related papers (2024-02-26T19:52:33Z) - RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models [42.20230095700904]
RealCompo is a new training-free and transferred-friendly text-to-image generation framework.
An intuitive and novel balancer is proposed to balance the strengths of the two models in denoising process.
Our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models.
arXiv Detail & Related papers (2024-02-20T10:56:52Z) - CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image
Diffusion Models [48.10798436003449]
Images produced by text-to-image diffusion models might not always faithfully represent the semantic intent of the provided text prompt.
Our work introduces a novel perspective by tackling this challenge in a contrastive context.
We conduct extensive experiments across a wide variety of scenarios, each involving unique combinations of objects, attributes, and scenes.
arXiv Detail & Related papers (2023-12-11T01:42:15Z) - Object-Centric Relational Representations for Image Generation [18.069747511100132]
This paper explores a novel method to condition image generation, based on object-centric relational representations.
We show that such architectural biases entail properties that facilitate the manipulation and conditioning of the generative process.
We also propose a novel benchmark for image generation consisting of a synthetic dataset of images paired with their relational representation.
arXiv Detail & Related papers (2023-03-26T11:17:17Z) - Reduce, Reuse, Recycle: Compositional Generation with Energy-Based
Diffusion Models and MCMC [106.06185677214353]
diffusion models have quickly become the prevailing approach to generative modeling in many domains.
We propose an energy-based parameterization of diffusion models which enables the use of new compositional operators.
We find these samplers lead to notable improvements in compositional generation across a wide set of problems.
arXiv Detail & Related papers (2023-02-22T18:48:46Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.