Related papers: Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

URL: http://arxiv.org/abs/2505.18730v1
Date: Sat, 24 May 2025 14:56:09 GMT
Title: Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation
Authors: Wenchao Zhang, Jiahe Tian, Runze He, Jizhong Han, Jiao Dai, Miaomiao Feng, Wei Mi, Xiaodan Zhang,
Abstract summary: We introduce Align Beyond Prompts (ABP), a benchmark to measure alignment of generated images with real-world knowledge beyond prompts.<n>ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios.<n>ABPScore is a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts.
Score: 10.583920883457635
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a comprehensive benchmark designed to measure the alignment of generated images with real-world knowledge that extends beyond the explicit user prompts. ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios. We further introduce ABPScore, a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts, which demonstrates strong correlations with human judgments. Through a comprehensive evaluation of 8 popular T2I models using ABP, we find that even state-of-the-art models, such as GPT-4o, face limitations in integrating simple real-world knowledge into generated images. To mitigate this issue, we introduce a training-free strategy within ABP, named Inference-Time Knowledge Injection (ITKI). By applying this strategy to optimize 200 challenging samples, we achieved an improvement of approximately 43% in ABPScore. The dataset and code are available in https://github.com/smile365317/ABP.

Related papers

Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models [15.983959465314749]
We introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models.<n>This benchmark consists of 1,100 prompts across three core categories.<n>We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees.
arXiv Detail & Related papers (2025-11-23T03:44:54Z)
TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency [81.17906057429329]
LPG-Bench is a comprehensive benchmark for evaluating long-prompt-based text-to-image generation.<n>We generate 2,600 images from 13 state-of-the-art models and perform comprehensive human-ranked annotations.<n>We introduce a novel zero-shot metric based on text-to-image-to-text consistency, termed TIT, for evaluating long-prompt-generated images.
arXiv Detail & Related papers (2025-10-03T13:25:16Z)
Why Settle for One? Text-to-ImageSet Generation and Evaluation [72.55708276046124]
Text-to-ImageSet (T2IS) generation aims to generate sets of images that meet various consistency requirements based on user instructions.<n>We propose $textbfAutoT2IS$, a training-free framework that maximally leverages pretrained Transformers' in-context capabilities to harmonize visual elements.<n>Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value.
arXiv Detail & Related papers (2025-06-29T15:01:16Z)
Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning [69.33115351856785]
We present a novel method, called T2I-PAL, to tackle the modality gap issue when using only text captions for PEFT.<n>The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions.<n>Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average.
arXiv Detail & Related papers (2025-06-12T11:09:49Z)
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation [23.05106664412349]
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts.<n>OneIG-Bench is a benchmark framework for evaluation of T2I models across multiple dimensions.
arXiv Detail & Related papers (2025-06-09T17:50:21Z)
IA-T2I: Internet-Augmented Text-to-Image Generation [13.765327654914199]
Current text-to-image (T2I) generation models achieve promising results, but they fail on the scenarios where the knowledge implied in the text prompt is uncertain.<n>We propose an Internet-Augmented text-to-image generation (IA-T2I) framework to compel T2I models clear about such uncertain knowledge by providing them with reference images.
arXiv Detail & Related papers (2025-05-21T17:31:49Z)
WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation [38.196609962452655]
textbfWorldGenBench is a benchmark designed to evaluate T2I models' world knowledge grounding and implicit inferential capabilities.<n>We propose the textbfKnowledge Checklist Score, a structured metric that measures how well generated images satisfy key semantic expectations.<n>Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems.
arXiv Detail & Related papers (2025-05-02T17:59:06Z)
Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z)
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation [26.61175134316007]
Text-to-formed (T2I) models are capable of generating high-quality artistic creations and visual content.<n>We propose $textbfWISE, the first benchmark specifically designed for $textbfWorld Knowledge incorporation$bfIntext $textbfSemantic $textbfE$valuation.
arXiv Detail & Related papers (2025-03-10T12:47:53Z)
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation [59.53678957969471]
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks.<n> generating interleaved image-text content remains a challenge.<n>OpenING is a benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks.<n>IntJudge is a judge model for evaluating open-ended multimodal generation methods.
arXiv Detail & Related papers (2024-11-27T16:39:04Z)
Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment [53.45813302866466]
We present ISG, a comprehensive evaluation framework for interleaved text-and-image generation.<n>ISG evaluates responses on four levels of granularity: holistic, structural, block-level, and image-specific.<n>In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories.
arXiv Detail & Related papers (2024-11-26T07:55:57Z)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC) This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z)
JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images. Our meticulously curated dataset comprises 4 million distinct and high-quality generated images. On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.