Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models
- URL: http://arxiv.org/abs/2410.22775v1
- Date: Wed, 30 Oct 2024 07:43:29 GMT
- Title: Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models
- Authors: Arash Marioriyad, Parham Rezaei, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban,
- Abstract summary: Text-to-image (T2I) generative models have shown remarkable proficiency in producing high-quality, realistic, and natural images.
New open-source diffusion-based T2I model, FLUX, has been introduced, demonstrating strong performance in high-quality image generation.
We evaluate the compositional generation capabilities of these newly introduced models against established models using the T2I-CompBench benchmark.
- Score: 3.5999252362400993
- License:
- Abstract: Text-to-image (T2I) generative models, such as Stable Diffusion and DALL-E, have shown remarkable proficiency in producing high-quality, realistic, and natural images from textual descriptions. However, these models sometimes fail to accurately capture all the details specified in the input prompts, particularly concerning entities, attributes, and spatial relationships. This issue becomes more pronounced when the prompt contains novel or complex compositions, leading to what are known as compositional generation failure modes. Recently, a new open-source diffusion-based T2I model, FLUX, has been introduced, demonstrating strong performance in high-quality image generation. Additionally, autoregressive T2I models like LlamaGen have claimed competitive visual quality performance compared to diffusion-based models. In this study, we evaluate the compositional generation capabilities of these newly introduced models against established models using the T2I-CompBench benchmark. Our findings reveal that LlamaGen, as a vanilla autoregressive model, is not yet on par with state-of-the-art diffusion models for compositional generation tasks under the same criteria, such as model size and inference time. On the other hand, the open-source diffusion-based model FLUX exhibits compositional generation capabilities comparable to the state-of-the-art closed-source model DALL-E3.
Related papers
- IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models [52.73820275861131]
Text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation.
Recent models such as FLUX.1 and Ideogram2.0 have demonstrated exceptional performance across various complex tasks.
This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability.
arXiv Detail & Related papers (2025-01-23T18:58:33Z) - Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation [70.8833857249951]
IterComp is a novel framework that aggregates composition-aware model preferences from multiple models.
We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner.
IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation.
arXiv Detail & Related papers (2024-10-09T17:59:13Z) - GenMix: Combining Generative and Mixture Data Augmentation for Medical Image Classification [0.6554326244334868]
We propose a novel data augmentation technique called GenMix.
It combines generative and mixture approaches to leverage the strengths of both methods.
Our results demonstrate that GenMix enhances the performance of various generative models.
arXiv Detail & Related papers (2024-05-31T07:32:31Z) - MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models [34.611309081801345]
Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation.
In this paper, we propose a novel strategy to scale a generative model across new tasks with minimal compute.
arXiv Detail & Related papers (2024-04-15T17:55:56Z) - RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models [42.20230095700904]
RealCompo is a new training-free and transferred-friendly text-to-image generation framework.
An intuitive and novel balancer is proposed to balance the strengths of the two models in denoising process.
Our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models.
arXiv Detail & Related papers (2024-02-20T10:56:52Z) - Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional
Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation.
We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z) - Conditional Generation from Unconditional Diffusion Models using
Denoiser Representations [94.04631421741986]
We propose adapting pre-trained unconditional diffusion models to new conditions using the learned internal representations of the denoiser network.
We show that augmenting the Tiny ImageNet training set with synthetic images generated by our approach improves the classification accuracy of ResNet baselines by up to 8%.
arXiv Detail & Related papers (2023-06-02T20:09:57Z) - Implementing and Experimenting with Diffusion Models for Text-to-Image
Generation [0.0]
Two models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image.
Text-to-image models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet.
This thesis contributes by reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model.
arXiv Detail & Related papers (2022-09-22T12:03:33Z) - Compositional Visual Generation with Composable Diffusion Models [80.75258849913574]
We propose an alternative structured approach for compositional generation using diffusion models.
An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image.
The proposed method can generate scenes at test time that are substantially more complex than those seen in training.
arXiv Detail & Related papers (2022-06-03T17:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.