Compositional Visual Generation with Composable Diffusion Models
- URL: http://arxiv.org/abs/2206.01714v1
- Date: Fri, 3 Jun 2022 17:47:04 GMT
- Title: Compositional Visual Generation with Composable Diffusion Models
- Authors: Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, Joshua B. Tenenbaum
- Abstract summary: We propose an alternative structured approach for compositional generation using diffusion models.
An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image.
The proposed method can generate scenes at test time that are substantially more complex than those seen in training.
- Score: 80.75258849913574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large text-guided diffusion models, such as DALLE-2, are able to generate
stunning photorealistic images given natural language descriptions. While such
models are highly flexible, they struggle to understand the composition of
certain concepts, such as confusing the attributes of different objects or
relations between objects. In this paper, we propose an alternative structured
approach for compositional generation using diffusion models. An image is
generated by composing a set of diffusion models, with each of them modeling a
certain component of the image. To do this, we interpret diffusion models as
energy-based models in which the data distributions defined by the energy
functions may be explicitly combined. The proposed method can generate scenes
at test time that are substantially more complex than those seen in training,
composing sentence descriptions, object relations, human facial attributes, and
even generalizing to new combinations that are rarely seen in the real world.
We further illustrate how our approach may be used to compose pre-trained
text-guided diffusion models and generate photorealistic images containing all
the details described in the input descriptions, including the binding of
certain object attributes that have been shown difficult for DALLE-2. These
results point to the effectiveness of the proposed method in promoting
structured generalization for visual generation.
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Progressive Compositionality In Text-to-Image Generative Models [33.18510121342558]
We propose EvoGen, a new curriculum for contrastive learning of diffusion models.
In this work, we leverage large-language models (LLMs) to compose realistic, complex scenarios.
We also harness Visual-Question Answering (VQA) systems alongside diffusion models to automatically curate a contrastive dataset, ConPair.
arXiv Detail & Related papers (2024-10-22T05:59:29Z) - IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation [70.8833857249951]
IterComp is a novel framework that aggregates composition-aware model preferences from multiple models.
We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner.
IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation.
arXiv Detail & Related papers (2024-10-09T17:59:13Z) - How Diffusion Models Learn to Factorize and Compose [14.161975556325796]
Diffusion models are capable of generating photo-realistic images that combine elements which likely do not appear together in the training set.
We investigate whether and when diffusion models learn semantically meaningful and factorized representations of composable features.
arXiv Detail & Related papers (2024-08-23T17:59:03Z) - DiffusionPID: Interpreting Diffusion via Partial Information Decomposition [24.83767778658948]
We apply information-theoretic principles to decompose the input text prompt into its elementary components.
We analyze how individual tokens and their interactions shape the generated image.
We show that PID is a potent tool for evaluating and diagnosing text-to-image diffusion models.
arXiv Detail & Related papers (2024-06-07T18:17:17Z) - Diffexplainer: Towards Cross-modal Global Explanations with Diffusion Models [51.21351775178525]
DiffExplainer is a novel framework that, leveraging language-vision models, enables multimodal global explainability.
It employs diffusion models conditioned on optimized text prompts, synthesizing images that maximize class outputs.
The analysis of generated visual descriptions allows for automatic identification of biases and spurious features.
arXiv Detail & Related papers (2024-04-03T10:11:22Z) - A Phase Transition in Diffusion Models Reveals the Hierarchical Nature
of Data [55.748186000425996]
Recent advancements show that diffusion models can generate high-quality images.
We study this phenomenon in a hierarchical generative model of data.
Our analysis characterises the relationship between time and scale in diffusion models.
arXiv Detail & Related papers (2024-02-26T19:52:33Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.