IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models
- URL: http://arxiv.org/abs/2501.13920v1
- Date: Thu, 23 Jan 2025 18:58:33 GMT
- Title: IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models
- Authors: Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li,
- Abstract summary: Text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation.
Recent models such as FLUX.1 and Ideogram2.0 have demonstrated exceptional performance across various complex tasks.
This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability.
- Score: 52.73820275861131
- License:
- Abstract: With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.
Related papers
- Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models [3.5999252362400993]
Text-to-image (T2I) generative models have shown remarkable proficiency in producing high-quality, realistic, and natural images.
New open-source diffusion-based T2I model, FLUX, has been introduced, demonstrating strong performance in high-quality image generation.
We evaluate the compositional generation capabilities of these newly introduced models against established models using the T2I-CompBench benchmark.
arXiv Detail & Related papers (2024-10-30T07:43:29Z) - Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes [65.22070581594426]
"Implicit-Zoo" is a large-scale dataset requiring thousands of GPU training days to facilitate research and development in this field.
We showcase two immediate benefits as it enables to: (1) learn token locations for transformer models; (2) directly regress 3D cameras poses of 2D images with respect to NeRF models.
This in turn leads to an improved performance in all three task of image classification, semantic segmentation, and 3D pose regression, thereby unlocking new avenues for research.
arXiv Detail & Related papers (2024-06-25T10:20:44Z) - PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense.
Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge.
We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z) - Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in
Text-to-Image Generation [3.976813869450304]
We focus on enhancing color and contrast, improving generation across multiple aspect ratios, and improving human-centric fine details.
Our model is open-source, and we hope the development of Playground v2.5 provides valuable guidelines for researchers aiming to elevate the aesthetic quality of diffusion-based image generation models.
arXiv Detail & Related papers (2024-02-27T06:31:52Z) - Breathing New Life into 3D Assets with Generative Repainting [74.80184575267106]
Diffusion-based text-to-image models ignited immense attention from the vision community, artists, and content creators.
Recent works have proposed various pipelines powered by the entanglement of diffusion models and neural fields.
We explore the power of pretrained 2D diffusion models and standard 3D neural radiance fields as independent, standalone tools.
Our pipeline accepts any legacy renderable geometry, such as textured or untextured meshes, and orchestrates the interaction between 2D generative refinement and 3D consistency enforcement tools.
arXiv Detail & Related papers (2023-09-15T16:34:51Z) - Visual Programming for Text-to-Image Generation and Evaluation [73.12069620086311]
We propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation.
First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation.
Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming.
arXiv Detail & Related papers (2023-05-24T16:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.