Progressive Compositionality In Text-to-Image Generative Models
- URL: http://arxiv.org/abs/2410.16719v1
- Date: Tue, 22 Oct 2024 05:59:29 GMT
- Title: Progressive Compositionality In Text-to-Image Generative Models
- Authors: Xu Han, Linghao Jin, Xiaofeng Liu, Paul Pu Liang,
- Abstract summary: We propose EvoGen, a new curriculum for contrastive learning of diffusion models.
In this work, we leverage large-language models (LLMs) to compose realistic, complex scenarios.
We also harness Visual-Question Answering (VQA) systems alongside diffusion models to automatically curate a contrastive dataset, ConPair.
- Score: 33.18510121342558
- License:
- Abstract: Despite the impressive text-to-image (T2I) synthesis capabilities of diffusion models, they often struggle to understand compositional relationships between objects and attributes, especially in complex settings. Existing solutions have tackled these challenges by optimizing the cross-attention mechanism or learning from the caption pairs with minimal semantic changes. However, can we generate high-quality complex contrastive images that diffusion models can directly discriminate based on visual representations? In this work, we leverage large-language models (LLMs) to compose realistic, complex scenarios and harness Visual-Question Answering (VQA) systems alongside diffusion models to automatically curate a contrastive dataset, ConPair, consisting of 15k pairs of high-quality contrastive images. These pairs feature minimal visual discrepancies and cover a wide range of attribute categories, especially complex and natural scenarios. To learn effectively from these error cases, i.e., hard negative images, we propose EvoGen, a new multi-stage curriculum for contrastive learning of diffusion models. Through extensive experiments across a wide range of compositional scenarios, we showcase the effectiveness of our proposed framework on compositional T2I benchmarks.
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation [70.8833857249951]
IterComp is a novel framework that aggregates composition-aware model preferences from multiple models.
We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner.
IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation.
arXiv Detail & Related papers (2024-10-09T17:59:13Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models [42.20230095700904]
RealCompo is a new training-free and transferred-friendly text-to-image generation framework.
An intuitive and novel balancer is proposed to balance the strengths of the two models in denoising process.
Our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models.
arXiv Detail & Related papers (2024-02-20T10:56:52Z) - CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image
Diffusion Models [48.10798436003449]
Images produced by text-to-image diffusion models might not always faithfully represent the semantic intent of the provided text prompt.
Our work introduces a novel perspective by tackling this challenge in a contrastive context.
We conduct extensive experiments across a wide variety of scenarios, each involving unique combinations of objects, attributes, and scenes.
arXiv Detail & Related papers (2023-12-11T01:42:15Z) - Training-Free Structured Diffusion Guidance for Compositional
Text-to-Image Synthesis [78.28620571530706]
Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks.
We improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions.
arXiv Detail & Related papers (2022-12-09T18:30:24Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - Explicit and implicit models in infrared and visible image fusion [5.842112272932475]
This paper discusses the limitations of deep learning models in image fusion and the corresponding optimization strategies.
Ten models for comparison experiments on 21 test sets were screened.
The qualitative and quantitative results show that the implicit models have more comprehensive ability to learn image features.
arXiv Detail & Related papers (2022-06-20T06:05:09Z) - Compositional Visual Generation with Composable Diffusion Models [80.75258849913574]
We propose an alternative structured approach for compositional generation using diffusion models.
An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image.
The proposed method can generate scenes at test time that are substantially more complex than those seen in training.
arXiv Detail & Related papers (2022-06-03T17:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.