IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
- URL: http://arxiv.org/abs/2410.07171v1
- Date: Wed, 9 Oct 2024 17:59:13 GMT
- Title: IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
- Authors: Xinchen Zhang, Ling Yang, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, Bin Cui,
- Abstract summary: IterComp is a novel framework that aggregates composition-aware model preferences from multiple models.
We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner.
IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation.
- Score: 70.8833857249951
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made notable strides in compositional text-to-image generation. However, these methods typically exhibit distinct strengths for compositional generation, with some excelling in handling attribute binding and others in spatial relationships. This disparity highlights the need for an approach that can leverage the complementary strengths of various models to comprehensively improve the composition capability. To this end, we introduce IterComp, a novel framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation. Specifically, we curate a gallery of six powerful open-source diffusion models and evaluate their three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. Based on these metrics, we develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models. Then, we propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations. Theoretical proof demonstrates the effectiveness and extensive experiments show our significant superiority over previous SOTA methods (e.g., Omost and FLUX), particularly in multi-category object composition and complex semantic alignment. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation. Code: https://github.com/YangLing0818/IterComp
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Progressive Compositionality In Text-to-Image Generative Models [33.18510121342558]
We propose EvoGen, a new curriculum for contrastive learning of diffusion models.
In this work, we leverage large-language models (LLMs) to compose realistic, complex scenarios.
We also harness Visual-Question Answering (VQA) systems alongside diffusion models to automatically curate a contrastive dataset, ConPair.
arXiv Detail & Related papers (2024-10-22T05:59:29Z) - ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models [65.82630283336051]
We show that the space spanned by the combination of dimensions and attributes is insufficiently sampled by existing training scheme of diffusion generative models.
We present a simple fix to this problem by constructing processes that fully exploit the structures, hence the name ComboStoc.
arXiv Detail & Related papers (2024-05-22T15:23:10Z) - Robust Few-Shot Ensemble Learning with Focal Diversity-Based Pruning [10.551984557040102]
This paper presents FusionShot, a focal diversity optimized few-shot ensemble learning approach.
It boosts the robustness and generalization performance of pre-trained few-shot models.
Experiments on representative few-shot benchmarks show that the top-K ensembles recommended by FusionShot can outperform the representative SOTA few-shot models.
arXiv Detail & Related papers (2024-04-05T22:21:49Z) - RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models [42.20230095700904]
RealCompo is a new training-free and transferred-friendly text-to-image generation framework.
An intuitive and novel balancer is proposed to balance the strengths of the two models in denoising process.
Our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models.
arXiv Detail & Related papers (2024-02-20T10:56:52Z) - Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion [50.59261592343479]
We present Kandinsky1, a novel exploration of latent diffusion architecture.
The proposed model is trained separately to map text embeddings to image embeddings of CLIP.
We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
arXiv Detail & Related papers (2023-10-05T12:29:41Z) - Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC [102.64648158034568]
diffusion models have quickly become the prevailing approach to generative modeling in many domains.
We propose an energy-based parameterization of diffusion models which enables the use of new compositional operators.
We find these samplers lead to notable improvements in compositional generation across a wide set of problems.
arXiv Detail & Related papers (2023-02-22T18:48:46Z) - Compositional Visual Generation with Composable Diffusion Models [80.75258849913574]
We propose an alternative structured approach for compositional generation using diffusion models.
An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image.
The proposed method can generate scenes at test time that are substantially more complex than those seen in training.
arXiv Detail & Related papers (2022-06-03T17:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.