Related papers: DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

URL: http://arxiv.org/abs/2505.16915v1
Date: Thu, 22 May 2025 17:11:27 GMT
Title: DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?
Authors: Qirui Jiao, Daoyuan Chen, Yilun Huang, Xika Lin, Ying Shen, Yaliang Li,
Abstract summary: We present DetailMaster, the first comprehensive benchmark designed to evaluate text-to-image (T2I) models.<n>The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators.<n> Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations.
Score: 30.739878622982847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While recent text-to-image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, their performance significantly degrades when confronted with long, detail-intensive prompts required in professional applications. We present DetailMaster, the first comprehensive benchmark specifically designed to evaluate T2I models' systematical abilities to handle extended textual inputs that contain complex compositional requirements. Our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Explicit Spatial/Interactive Relationships. The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators. Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations: state-of-the-art models achieve merely ~50% accuracy in key dimensions like attribute binding and spatial reasoning, while all models showing progressive performance degradation as prompt length increases. Our analysis highlights systemic failures in structural comprehension and detail overload handling, motivating future research into architectures with enhanced compositional reasoning. We open-source the dataset, data curation code, and evaluation tools to advance detail-rich T2I generation and enable broad applications that would otherwise be infeasible due to the lack of a dedicated benchmark.

Related papers

Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models [6.140839748607505]
Detail++ is a training-free framework for Progressive Detail Injection (PDI) generation.<n>We decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages.<n>Experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods.
arXiv Detail & Related papers (2025-07-23T18:20:46Z)
LVLM-Composer's Explicit Planning for Image Generation [0.0]
We introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis.<n>Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a Fine-Grained Feature Alignment Mechanism for precise visual guidance during generation.<n>Experiments on the LongBench-T2I benchmark, utilizing automatic evaluation by Gemini-2.0-Flash and InternVL3-78B, demonstrate LVLM-Composer's superior performance across critical compositional dimensions.
arXiv Detail & Related papers (2025-07-05T20:21:03Z)
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation [23.05106664412349]
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts.<n>OneIG-Bench is a benchmark framework for evaluation of T2I models across multiple dimensions.
arXiv Detail & Related papers (2025-06-09T17:50:21Z)
TIIF-Bench: How Does Your T2I Model Follow Your Instructions? [7.13169573900556]
We present TIIF-Bench (Text-to-Image Instruction Following Benchmark), aiming to systematically assess T2I models' ability in interpreting and following intricate textual instructions.<n> TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions, which are categorized into three levels of difficulties and complexities.<n>Two critical attributes, i.e. text rendering and style control, are introduced to evaluate the precision of text synthesis and the aesthetic coherence of T2I models.
arXiv Detail & Related papers (2025-06-02T18:44:07Z)
Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation [9.978181430065987]
LongBench-T2I is a benchmark for evaluating text-to-image (T2I) models under complex instructions.<n>LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions.<n>Plan2Gen is a framework that facilitates complex instruction-driven image generation without requiring additional model training.
arXiv Detail & Related papers (2025-05-30T16:48:14Z)
GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts [53.568057283934714]
We propose a Vision-Language Model (VLM)-based framework that generates content-aware text logo layouts.<n>We introduce two model techniques that reduce the computational cost for processing multiple glyph images simultaneously.<n>To support instruction tuning of our model, we construct two extensive text logo datasets that are five times larger than existing public datasets.
arXiv Detail & Related papers (2024-11-18T10:04:10Z)
Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models [25.301280441283147]
This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance.<n>We develop a novel retrieval evaluation benchmark spanning six document-level attributes.<n>Our findings indicate that although fine-tuning models on instruction-aware retrieval datasets enhance performance, most models still fall short of instruction compliance.
arXiv Detail & Related papers (2024-10-31T11:47:21Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense. Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge. We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC) This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z)
ImageInWords: Unlocking Hyper-Detailed Image Descriptions [36.373619800014275]
ImageInWords (IIW) is a human-in-the-loop framework for curating hyper-detailed image descriptions. We show major gains compared to recent datasets in comprehensiveness, specificity, hallucinations, and more. We also show that fine-tuning with IIW data improves these metrics by +31% against models trained with prior work, even with only 9k samples.
arXiv Detail & Related papers (2024-05-05T02:15:11Z)
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets. We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks. We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z)
T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation [55.16845189272573]
T2I-CompBench++ is an enhanced benchmark for compositional text-to-image generation.<n>It comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions.
arXiv Detail & Related papers (2023-07-12T17:59:42Z)
Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects. We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.