Related papers: UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

URL: http://arxiv.org/abs/2510.18701v1
Date: Tue, 21 Oct 2025 14:56:46 GMT
Title: UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
Authors: Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang,
Abstract summary: We introduce UniGenBench++, a unified semantic assessment benchmark for text-to-image generation.<n>It comprises 600 prompts organized hierarchically to ensure both coverage and efficiency.<n>It provides both English and Chinese versions of each prompt in short and long forms.
Score: 40.644151228285246
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.

Related papers

UniREditBench: A Unified Reasoning-based Image Editing Benchmark [52.54256348710893]
This work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation.<n>It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions.<n>We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings.
arXiv Detail & Related papers (2025-11-03T07:24:57Z)
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z)
TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency [81.17906057429329]
LPG-Bench is a comprehensive benchmark for evaluating long-prompt-based text-to-image generation.<n>We generate 2,600 images from 13 state-of-the-art models and perform comprehensive human-ranked annotations.<n>We introduce a novel zero-shot metric based on text-to-image-to-text consistency, termed TIT, for evaluating long-prompt-generated images.
arXiv Detail & Related papers (2025-10-03T13:25:16Z)
MEF: A Systematic Evaluation Framework for Text-to-Image Models [21.006921005280493]
Current evaluations rely on either ELO for overall ranking or MOS for dimension-specific scoring.<n>We introduce the Magic Evaluation Framework (MEF), a systematic and practical approach for evaluating T2I models.<n>We release our evaluation framework and make Magic-Bench-377 fully open-source to advance research in the evaluation of visual generative models.
arXiv Detail & Related papers (2025-09-22T15:32:42Z)
The Telephone Game: Evaluating Semantic Drift in Unified Models [41.650904633974584]
Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research.<n>Existing evaluation benchmarks consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T.<n>These isolated single-pass metrics do not reveal cross-consistency: whether a model that "understands" a concept can also "render" it, nor whether semantic meaning
arXiv Detail & Related papers (2025-09-04T17:53:52Z)
Why Settle for One? Text-to-ImageSet Generation and Evaluation [72.55708276046124]
Text-to-ImageSet (T2IS) generation aims to generate sets of images that meet various consistency requirements based on user instructions.<n>We propose $textbfAutoT2IS$, a training-free framework that maximally leverages pretrained Transformers' in-context capabilities to harmonize visual elements.<n>Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value.
arXiv Detail & Related papers (2025-06-29T15:01:16Z)
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation [23.05106664412349]
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts.<n>OneIG-Bench is a benchmark framework for evaluation of T2I models across multiple dimensions.
arXiv Detail & Related papers (2025-06-09T17:50:21Z)
TIIF-Bench: How Does Your T2I Model Follow Your Instructions? [7.13169573900556]
We present TIIF-Bench (Text-to-Image Instruction Following Benchmark), aiming to systematically assess T2I models' ability in interpreting and following intricate textual instructions.<n> TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions, which are categorized into three levels of difficulties and complexities.<n>Two critical attributes, i.e. text rendering and style control, are introduced to evaluate the precision of text synthesis and the aesthetic coherence of T2I models.
arXiv Detail & Related papers (2025-06-02T18:44:07Z)
Multi-Modal Language Models as Text-to-Image Model Evaluators [16.675735328424786]
Multimodal Text-to-Image Eval (MT2IE) is an evaluation framework that iteratively generates prompts for evaluation.<n>We show that MT2IE's prompt-generation consistency scores have higher correlation with human judgment than scores previously introduced in the literature.
arXiv Detail & Related papers (2025-05-01T17:47:55Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.