Related papers: RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models

RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2402.12908v3
Date: Mon, 14 Oct 2024 07:27:37 GMT
Title: RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models
Authors: Xinchen Zhang, Ling Yang, Yaqi Cai, Zhaochen Yu, Kai-Ni Wang, Jiake Xie, Ye Tian, Minkai Xu, Yong Tang, Yujiu Yang, Bin Cui,
Abstract summary: RealCompo is a new training-free and transferred-friendly text-to-image generation framework. An intuitive and novel balancer is proposed to balance the strengths of the two models in denoising process. Our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models.
Score: 42.20230095700904
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have achieved remarkable advancements in text-to-image generation. However, existing models still have many difficulties when faced with multiple-object compositional generation. In this paper, we propose RealCompo, a new training-free and transferred-friendly text-to-image generation framework, which aims to leverage the respective advantages of text-to-image models and spatial-aware image diffusion models (e.g., layout, keypoints and segmentation maps) to enhance both realism and compositionality of the generated images. An intuitive and novel balancer is proposed to dynamically balance the strengths of the two models in denoising process, allowing plug-and-play use of any model without extra training. Extensive experiments show that our RealCompo consistently outperforms state-of-the-art text-to-image models and spatial-aware image diffusion models in multiple-object compositional generation while keeping satisfactory realism and compositionality of the generated images. Notably, our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models. Our code is available at: https://github.com/YangLing0818/RealCompo

Related papers

GMAIL: Generative Modality Alignment for generated Image Learning [51.071351994330605]
We propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images.<n>Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments.
arXiv Detail & Related papers (2026-02-17T05:40:25Z)
Boosting Generative Image Modeling via Joint Image-Feature Synthesis [10.32324138962724]
We introduce a novel generative image modeling framework that seamlessly bridges the gap by leveraging a diffusion model to jointly model low-level image latents. Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance.
arXiv Detail & Related papers (2025-04-22T17:41:42Z)
Progressive Compositionality In Text-to-Image Generative Models [33.18510121342558]
We propose EvoGen, a new curriculum for contrastive learning of diffusion models. In this work, we leverage large-language models (LLMs) to compose realistic, complex scenarios. We also harness Visual-Question Answering (VQA) systems alongside diffusion models to automatically curate a contrastive dataset, ConPair.
arXiv Detail & Related papers (2024-10-22T05:59:29Z)
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL. We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z)
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation [70.8833857249951]
IterComp is a novel framework that aggregates composition-aware model preferences from multiple models. We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation.
arXiv Detail & Related papers (2024-10-09T17:59:13Z)
MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models [34.611309081801345]
Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation. In this paper, we propose a novel strategy to scale a generative model across new tasks with minimal compute.
arXiv Detail & Related papers (2024-04-15T17:55:56Z)
Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation. We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z)
Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis. Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z)
Compositional Visual Generation with Composable Diffusion Models [80.75258849913574]
We propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. The proposed method can generate scenes at test time that are substantially more complex than those seen in training.
arXiv Detail & Related papers (2022-06-03T17:47:04Z)
DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder [73.1010640692609]
We propose a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis. Our model achieves state-of-the-art results and generates more photorealistic images specifically.
arXiv Detail & Related papers (2022-06-01T10:39:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.