Related papers: Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation

Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation

URL: http://arxiv.org/abs/2505.24787v1
Date: Fri, 30 May 2025 16:48:14 GMT
Title: Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation
Authors: Yucheng Zhou, Jiahao Yuan, Qianning Wang,
Abstract summary: LongBench-T2I is a benchmark for evaluating text-to-image (T2I) models under complex instructions.<n>LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions.<n>Plan2Gen is a framework that facilitates complex instruction-driven image generation without requiring additional model training.
Score: 9.978181430065987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in text-to-image (T2I) generation have enabled models to produce high-quality images from textual descriptions. However, these models often struggle with complex instructions involving multiple objects, attributes, and spatial relationships. Existing benchmarks for evaluating T2I models primarily focus on general text-image alignment and fail to capture the nuanced requirements of complex, multi-faceted prompts. Given this gap, we introduce LongBench-T2I, a comprehensive benchmark specifically designed to evaluate T2I models under complex instructions. LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions, enabling a thorough assessment of a model's ability to follow complex instructions. Beyond benchmarking, we propose an agent framework (Plan2Gen) that facilitates complex instruction-driven image generation without requiring additional model training. This framework integrates seamlessly with existing T2I models, using large language models to interpret and decompose complex prompts, thereby guiding the generation process more effectively. As existing evaluation metrics, such as CLIPScore, fail to adequately capture the nuances of complex instructions, we introduce an evaluation toolkit that automates the quality assessment of generated images using a set of multi-dimensional metrics. The data and code are released at https://github.com/yczhou001/LongBench-T2I.

Related papers

TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency [81.17906057429329]
LPG-Bench is a comprehensive benchmark for evaluating long-prompt-based text-to-image generation.<n>We generate 2,600 images from 13 state-of-the-art models and perform comprehensive human-ranked annotations.<n>We introduce a novel zero-shot metric based on text-to-image-to-text consistency, termed TIT, for evaluating long-prompt-generated images.
arXiv Detail & Related papers (2025-10-03T13:25:16Z)
AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models [58.85362281293525]
We introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts.<n>We experimentally validate that leading T2I models do not fare well on AcT2I.<n>We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation.
arXiv Detail & Related papers (2025-09-19T16:41:39Z)
DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models [9.800887055353096]
We propose DeCoT (Decomposition-CoT), a framework to enhance T2I models' understanding and execution of complex instructions.<n>Extensive experiments on the LongBench-T2I dataset demonstrate that DeCoT consistently and substantially improves the performance of leading T2I models.
arXiv Detail & Related papers (2025-08-17T15:15:39Z)
Why Settle for One? Text-to-ImageSet Generation and Evaluation [47.63138480571058]
Text-to-ImageSet (T2IS) generation aims to generate sets of images that meet various consistency requirements based on user instructions.<n>We propose $textbfAutoT2IS$, a training-free framework that maximally leverages pretrained Transformers' in-context capabilities to harmonize visual elements.<n>Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value.
arXiv Detail & Related papers (2025-06-29T15:01:16Z)
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation [23.05106664412349]
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts.<n>OneIG-Bench is a benchmark framework for evaluation of T2I models across multiple dimensions.
arXiv Detail & Related papers (2025-06-09T17:50:21Z)
TIIF-Bench: How Does Your T2I Model Follow Your Instructions? [7.13169573900556]
We present TIIF-Bench (Text-to-Image Instruction Following Benchmark), aiming to systematically assess T2I models' ability in interpreting and following intricate textual instructions.<n> TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions, which are categorized into three levels of difficulties and complexities.<n>Two critical attributes, i.e. text rendering and style control, are introduced to evaluate the precision of text synthesis and the aesthetic coherence of T2I models.
arXiv Detail & Related papers (2025-06-02T18:44:07Z)
DetailMaster: Can Your Text-to-Image Model Handle Long Prompts? [30.739878622982847]
We present DetailMaster, the first comprehensive benchmark designed to evaluate text-to-image (T2I) models.<n>The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators.<n> Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations.
arXiv Detail & Related papers (2025-05-22T17:11:27Z)
CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback [58.27353205269664]
State-of-the-art T2I models are capable of generating high-resolution images given textual prompts.<n>However, they struggle with accurately depicting compositional scenes that specify multiple objects, attributes, and spatial relations.<n>We present CompAlign, a challenging benchmark with an emphasis on assessing the depiction of 3D-spatial relationships.
arXiv Detail & Related papers (2025-05-16T12:23:58Z)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model. We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z)
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets. We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks. We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z)
T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation [55.16845189272573]
T2I-CompBench++ is an enhanced benchmark for compositional text-to-image generation.<n>It comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions.
arXiv Detail & Related papers (2023-07-12T17:59:42Z)
Visual Programming for Text-to-Image Generation and Evaluation [73.12069620086311]
We propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming.
arXiv Detail & Related papers (2023-05-24T16:42:17Z)
LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model [55.20469538848806]
LeftRefill is an innovative approach to harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
arXiv Detail & Related papers (2023-05-19T10:29:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.