WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
- URL: http://arxiv.org/abs/2503.07265v1
- Date: Mon, 10 Mar 2025 12:47:53 GMT
- Title: WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
- Authors: Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, Li Yuan,
- Abstract summary: Text-to-formed (T2I) models are capable of generating high-quality artistic creations and visual content.<n>We propose $textbfWISE, the first benchmark specifically designed for $textbfWorld Knowledge incorporation$bfIntext $textbfSemantic $textbfE$valuation.
- Score: 26.61175134316007
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose $\textbf{WISE}$, the first benchmark specifically designed for $\textbf{W}$orld Knowledge-$\textbf{I}$nformed $\textbf{S}$emantic $\textbf{E}$valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce $\textbf{WiScore}$, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at https://github.com/PKU-YuanGroup/WISE.
Related papers
- Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models [15.983959465314749]
We introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models.<n>This benchmark consists of 1,100 prompts across three core categories.<n>We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees.
arXiv Detail & Related papers (2025-11-23T03:44:54Z) - AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models [58.85362281293525]
We introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts.<n>We experimentally validate that leading T2I models do not fare well on AcT2I.<n>We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation.
arXiv Detail & Related papers (2025-09-19T16:41:39Z) - GenExam: A Multidisciplinary Text-to-Image Exam [91.06661449186537]
GenExam is the first benchmark for multidisciplinary text-to-image exams.<n>It features 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy.<n>Each problem is equipped with ground-truth images and fine-grained scoring points.
arXiv Detail & Related papers (2025-09-17T17:59:14Z) - Text-Visual Semantic Constrained AI-Generated Image Quality Assessment [47.575342788480505]
We propose a unified framework to enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images.<n>Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules.<n>Tests conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2025-07-14T16:21:05Z) - Why Settle for One? Text-to-ImageSet Generation and Evaluation [47.63138480571058]
Text-to-ImageSet (T2IS) generation aims to generate sets of images that meet various consistency requirements based on user instructions.<n>We propose $textbfAutoT2IS$, a training-free framework that maximally leverages pretrained Transformers' in-context capabilities to harmonize visual elements.<n>Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value.
arXiv Detail & Related papers (2025-06-29T15:01:16Z) - OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation [23.05106664412349]
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts.<n>OneIG-Bench is a benchmark framework for evaluation of T2I models across multiple dimensions.
arXiv Detail & Related papers (2025-06-09T17:50:21Z) - Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation [10.583920883457635]
We introduce Align Beyond Prompts (ABP), a benchmark to measure alignment of generated images with real-world knowledge beyond prompts.<n>ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios.<n>ABPScore is a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts.
arXiv Detail & Related papers (2025-05-24T14:56:09Z) - WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation [38.196609962452655]
textbfWorldGenBench is a benchmark designed to evaluate T2I models' world knowledge grounding and implicit inferential capabilities.<n>We propose the textbfKnowledge Checklist Score, a structured metric that measures how well generated images satisfy key semantic expectations.<n>Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems.
arXiv Detail & Related papers (2025-05-02T17:59:06Z) - Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.
Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.
A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z) - TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.
Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.
Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z) - T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts [21.897804514122843]
We present T2I-FactualBench - the largest benchmark to date in terms of the number of concepts and prompts designed to evaluate the factuality of knowledge-intensive concept generation.<n>T2I-FactualBench consists of a three-tiered knowledge-intensive text-to-image generation framework, ranging from the basic memorization of individual knowledge concepts to the more complex composition of multiple knowledge concepts.
arXiv Detail & Related papers (2024-12-05T16:21:01Z) - KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models.
We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals.
Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z) - PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense.
Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge.
We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z) - Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation [5.55027585813848]
The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications.
We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text.
We demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores.
arXiv Detail & Related papers (2024-03-25T04:54:49Z) - Emu: Enhancing Image Generation Models Using Photogenic Needles in a
Haystack [75.00066365801993]
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text.
These pre-trained models often face challenges when it comes to generating highly aesthetic images.
We propose quality-tuning to guide a pre-trained model to exclusively generate highly visually appealing images.
arXiv Detail & Related papers (2023-09-27T17:30:19Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.