MEF: A Systematic Evaluation Framework for Text-to-Image Models
- URL: http://arxiv.org/abs/2509.17907v1
- Date: Mon, 22 Sep 2025 15:32:42 GMT
- Title: MEF: A Systematic Evaluation Framework for Text-to-Image Models
- Authors: Xiaojing Dong, Weilin Huang, Liang Li, Yiying Li, Shu Liu, Tongtong Ou, Shuang Ouyang, Yu Tian, Fengxuan Zhao,
- Abstract summary: Current evaluations rely on either ELO for overall ranking or MOS for dimension-specific scoring.<n>We introduce the Magic Evaluation Framework (MEF), a systematic and practical approach for evaluating T2I models.<n>We release our evaluation framework and make Magic-Bench-377 fully open-source to advance research in the evaluation of visual generative models.
- Score: 21.006921005280493
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Rapid advances in text-to-image (T2I) generation have raised higher requirements for evaluation methodologies. Existing benchmarks center on objective capabilities and dimensions, but lack an application-scenario perspective, limiting external validity. Moreover, current evaluations typically rely on either ELO for overall ranking or MOS for dimension-specific scoring, yet both methods have inherent shortcomings and limited interpretability. Therefore, we introduce the Magic Evaluation Framework (MEF), a systematic and practical approach for evaluating T2I models. First, we propose a structured taxonomy encompassing user scenarios, elements, element compositions, and text expression forms to construct the Magic-Bench-377, which supports label-level assessment and ensures a balanced coverage of both user scenarios and capabilities. On this basis, we combine ELO and dimension-specific MOS to generate model rankings and fine-grained assessments respectively. This joint evaluation method further enables us to quantitatively analyze the contribution of each dimension to user satisfaction using multivariate logistic regression. By applying MEF to current T2I models, we obtain a leaderboard and key characteristics of the leading models. We release our evaluation framework and make Magic-Bench-377 fully open-source to advance research in the evaluation of visual generative models.
Related papers
- UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation [40.644151228285246]
We introduce UniGenBench++, a unified semantic assessment benchmark for text-to-image generation.<n>It comprises 600 prompts organized hierarchically to ensure both coverage and efficiency.<n>It provides both English and Chinese versions of each prompt in short and long forms.
arXiv Detail & Related papers (2025-10-21T14:56:46Z) - Adapting Vision-Language Models for Evaluating World Models [24.813041196394582]
We present UNIVERSE, a method for adapting Vision-language Evaluator for Rollouts in Simulated Environments under data and compute constraints.<n>We conduct a large-scale study comparing full, partial, and parameter-efficient finetuning across task formats, context lengths, sampling strategies, and data compositions.<n>The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint.
arXiv Detail & Related papers (2025-06-22T09:53:28Z) - OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation [23.05106664412349]
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts.<n>OneIG-Bench is a benchmark framework for evaluation of T2I models across multiple dimensions.
arXiv Detail & Related papers (2025-06-09T17:50:21Z) - Minos: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text [51.149562188883486]
We introduce Minos-Corpus, a large-scale multimodal evaluation dataset that combines evaluation data from both human and GPT.<n>Based on this corpus, we propose Data Selection and Balance, Mix-SFT training methods, and apply DPO to develop Minos.
arXiv Detail & Related papers (2025-06-03T06:17:16Z) - From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z) - VideoGen-Eval: Agent-based System for Video Generation Evaluation [54.662739174367836]
Video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models.<n>We propose VideoGen-Eval, an agent evaluation system that integrates content structuring, MLLM-based content judgment, and patch tools for temporal-dense dimensions.<n>We introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system.
arXiv Detail & Related papers (2025-03-30T14:12:21Z) - Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings.<n>We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z) - EvalGIM: A Library for Evaluating Generative Image Models [26.631349186382664]
We introduce EvalGIM, a library for evaluating text-to-image generative models.<n>EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency.<n>EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models.
arXiv Detail & Related papers (2024-12-13T23:15:35Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.<n> Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.