Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation
- URL: http://arxiv.org/abs/2505.18730v1
- Date: Sat, 24 May 2025 14:56:09 GMT
- Title: Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation
- Authors: Wenchao Zhang, Jiahe Tian, Runze He, Jizhong Han, Jiao Dai, Miaomiao Feng, Wei Mi, Xiaodan Zhang,
- Abstract summary: We introduce Align Beyond Prompts (ABP), a benchmark to measure alignment of generated images with real-world knowledge beyond prompts.<n>ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios.<n>ABPScore is a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts.
- Score: 10.583920883457635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a comprehensive benchmark designed to measure the alignment of generated images with real-world knowledge that extends beyond the explicit user prompts. ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios. We further introduce ABPScore, a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts, which demonstrates strong correlations with human judgments. Through a comprehensive evaluation of 8 popular T2I models using ABP, we find that even state-of-the-art models, such as GPT-4o, face limitations in integrating simple real-world knowledge into generated images. To mitigate this issue, we introduce a training-free strategy within ABP, named Inference-Time Knowledge Injection (ITKI). By applying this strategy to optimize 200 challenging samples, we achieved an improvement of approximately 43% in ABPScore. The dataset and code are available in https://github.com/smile365317/ABP.
Related papers
- Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models [15.983959465314749]
We introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models.<n>This benchmark consists of 1,100 prompts across three core categories.<n>We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees.
arXiv Detail & Related papers (2025-11-23T03:44:54Z) - TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency [81.17906057429329]
LPG-Bench is a comprehensive benchmark for evaluating long-prompt-based text-to-image generation.<n>We generate 2,600 images from 13 state-of-the-art models and perform comprehensive human-ranked annotations.<n>We introduce a novel zero-shot metric based on text-to-image-to-text consistency, termed TIT, for evaluating long-prompt-generated images.
arXiv Detail & Related papers (2025-10-03T13:25:16Z) - Why Settle for One? Text-to-ImageSet Generation and Evaluation [72.55708276046124]
Text-to-ImageSet (T2IS) generation aims to generate sets of images that meet various consistency requirements based on user instructions.<n>We propose $textbfAutoT2IS$, a training-free framework that maximally leverages pretrained Transformers' in-context capabilities to harmonize visual elements.<n>Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value.
arXiv Detail & Related papers (2025-06-29T15:01:16Z) - Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning [69.33115351856785]
We present a novel method, called T2I-PAL, to tackle the modality gap issue when using only text captions for PEFT.<n>The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions.<n>Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average.
arXiv Detail & Related papers (2025-06-12T11:09:49Z) - OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation [23.05106664412349]
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts.<n>OneIG-Bench is a benchmark framework for evaluation of T2I models across multiple dimensions.
arXiv Detail & Related papers (2025-06-09T17:50:21Z) - IA-T2I: Internet-Augmented Text-to-Image Generation [13.765327654914199]
Current text-to-image (T2I) generation models achieve promising results, but they fail on the scenarios where the knowledge implied in the text prompt is uncertain.<n>We propose an Internet-Augmented text-to-image generation (IA-T2I) framework to compel T2I models clear about such uncertain knowledge by providing them with reference images.
arXiv Detail & Related papers (2025-05-21T17:31:49Z) - WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation [38.196609962452655]
textbfWorldGenBench is a benchmark designed to evaluate T2I models' world knowledge grounding and implicit inferential capabilities.<n>We propose the textbfKnowledge Checklist Score, a structured metric that measures how well generated images satisfy key semantic expectations.<n>Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems.
arXiv Detail & Related papers (2025-05-02T17:59:06Z) - Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z) - WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation [26.61175134316007]
Text-to-formed (T2I) models are capable of generating high-quality artistic creations and visual content.<n>We propose $textbfWISE, the first benchmark specifically designed for $textbfWorld Knowledge incorporation$bfIntext $textbfSemantic $textbfE$valuation.
arXiv Detail & Related papers (2025-03-10T12:47:53Z) - OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation [59.53678957969471]
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks.<n> generating interleaved image-text content remains a challenge.<n>OpenING is a benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks.<n>IntJudge is a judge model for evaluating open-ended multimodal generation methods.
arXiv Detail & Related papers (2024-11-27T16:39:04Z) - Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment [53.45813302866466]
We present ISG, a comprehensive evaluation framework for interleaved text-and-image generation.<n>ISG evaluates responses on four levels of granularity: holistic, structural, block-level, and image-specific.<n>In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories.
arXiv Detail & Related papers (2024-11-26T07:55:57Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images.
Our meticulously curated dataset comprises 4 million distinct and high-quality generated images.
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.