Related papers: Compositional Visual Generation with Composable Diffusion Models

Compositional Visual Generation with Composable Diffusion Models

URL: http://arxiv.org/abs/2206.01714v1
Date: Fri, 3 Jun 2022 17:47:04 GMT
Title: Compositional Visual Generation with Composable Diffusion Models
Authors: Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, Joshua B. Tenenbaum
Abstract summary: We propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. The proposed method can generate scenes at test time that are substantially more complex than those seen in training.
Score: 80.75258849913574
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large text-guided diffusion models, such as DALLE-2, are able to generate stunning photorealistic images given natural language descriptions. While such models are highly flexible, they struggle to understand the composition of certain concepts, such as confusing the attributes of different objects or relations between objects. In this paper, we propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. To do this, we interpret diffusion models as energy-based models in which the data distributions defined by the energy functions may be explicitly combined. The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world. We further illustrate how our approach may be used to compose pre-trained text-guided diffusion models and generate photorealistic images containing all the details described in the input descriptions, including the binding of certain object attributes that have been shown difficult for DALLE-2. These results point to the effectiveness of the proposed method in promoting structured generalization for visual generation.

Related papers

Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models. We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z)
Progressive Compositionality In Text-to-Image Generative Models [33.18510121342558]
We propose EvoGen, a new curriculum for contrastive learning of diffusion models. In this work, we leverage large-language models (LLMs) to compose realistic, complex scenarios. We also harness Visual-Question Answering (VQA) systems alongside diffusion models to automatically curate a contrastive dataset, ConPair.
arXiv Detail & Related papers (2024-10-22T05:59:29Z)
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation [70.8833857249951]
IterComp is a novel framework that aggregates composition-aware model preferences from multiple models. We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation.
arXiv Detail & Related papers (2024-10-09T17:59:13Z)
How Diffusion Models Learn to Factorize and Compose [14.161975556325796]
Diffusion models are capable of generating photo-realistic images that combine elements which likely do not appear together in the training set. We investigate whether and when diffusion models learn semantically meaningful and factorized representations of composable features.
arXiv Detail & Related papers (2024-08-23T17:59:03Z)
DiffusionPID: Interpreting Diffusion via Partial Information Decomposition [24.83767778658948]
We apply information-theoretic principles to decompose the input text prompt into its elementary components. We analyze how individual tokens and their interactions shape the generated image. We show that PID is a potent tool for evaluating and diagnosing text-to-image diffusion models.
arXiv Detail & Related papers (2024-06-07T18:17:17Z)
Diffexplainer: Towards Cross-modal Global Explanations with Diffusion Models [51.21351775178525]
DiffExplainer is a novel framework that, leveraging language-vision models, enables multimodal global explainability. It employs diffusion models conditioned on optimized text prompts, synthesizing images that maximize class outputs. The analysis of generated visual descriptions allows for automatic identification of biases and spurious features.
arXiv Detail & Related papers (2024-04-03T10:11:22Z)
A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data [55.748186000425996]
Recent advancements show that diffusion models can generate high-quality images. We study this phenomenon in a hierarchical generative model of data. Our analysis characterises the relationship between time and scale in diffusion models.
arXiv Detail & Related papers (2024-02-26T19:52:33Z)
Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis. Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.