Related papers: Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models

Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models

URL: http://arxiv.org/abs/2511.18271v1
Date: Sun, 23 Nov 2025 03:44:54 GMT
Title: Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models
Authors: Tianyang Han, Junhao Su, Junjie Hu, Peizhen Yang, Hengyu Shi, Junfeng Luo, Jialin Gao,
Abstract summary: We introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models.<n>This benchmark consists of 1,100 prompts across three core categories.<n>We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees.
Score: 15.983959465314749
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image (T2I) models today are capable of producing photorealistic, instruction-following images, yet they still frequently fail on prompts that require implicit world knowledge. Existing evaluation protocols either emphasize compositional alignment or rely on single-round VQA-based scoring, leaving critical dimensions such as knowledge grounding, multi-physics interactions, and auditable evidence-substantially undertested. To address these limitations, we introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models. This benchmark consists of 1,100 prompts across three core categories. To facilitate fine-grained evaluation, we propose PW-Agent, an evidence-grounded multi-agent evaluator to hierarchically assess images on their physical realism and logical consistency by decomposing prompts into verifiable visual evidence. We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees. The findings highlight the need for reasoning-aware, knowledge-integrative architectures in future T2I systems.

Related papers

AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models [58.85362281293525]
We introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts.<n>We experimentally validate that leading T2I models do not fare well on AcT2I.<n>We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation.
arXiv Detail & Related papers (2025-09-19T16:41:39Z)
Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play? [63.66192651248858]
T2I-CoReBench is a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models.<n>To increase complexity, driven by the inherent real-world complexities, we curate each prompt with higher compositional density for composition.<n>In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions.
arXiv Detail & Related papers (2025-09-03T17:58:12Z)
Why Settle for One? Text-to-ImageSet Generation and Evaluation [72.55708276046124]
Text-to-ImageSet (T2IS) generation aims to generate sets of images that meet various consistency requirements based on user instructions.<n>We propose $textbfAutoT2IS$, a training-free framework that maximally leverages pretrained Transformers' in-context capabilities to harmonize visual elements.<n>Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value.
arXiv Detail & Related papers (2025-06-29T15:01:16Z)
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation [23.05106664412349]
Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts.<n>OneIG-Bench is a benchmark framework for evaluation of T2I models across multiple dimensions.
arXiv Detail & Related papers (2025-06-09T17:50:21Z)
R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation [26.816674696050413]
Reasoning is a fundamental capability often required in real-world text-to-image (T2I) generation.<n>Recent T2I models have made impressive progress in producing photorealistic images, but their reasoning capability remains underdeveloped.<n>We introduce R2I-Bench, a benchmark specifically designed to rigorously assess reasoning-driven T2I generation.
arXiv Detail & Related papers (2025-05-29T14:43:46Z)
Replace in Translation: Boost Concept Alignment in Counterfactual Text-to-Image [53.09546752700792]
We propose a strategy to instruct this replacing process, which is called as Explicit Logical Narrative Prompt (ELNP)<n>We design a metric to calculate how many required concepts in the prompt can be covered averagely in the synthesized images.<n>The extensive experiments and qualitative comparisons demonstrate that our strategy can boost the concept alignment in counterfactual T2I.
arXiv Detail & Related papers (2025-05-20T13:27:52Z)
WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation [38.196609962452655]
textbfWorldGenBench is a benchmark designed to evaluate T2I models' world knowledge grounding and implicit inferential capabilities.<n>We propose the textbfKnowledge Checklist Score, a structured metric that measures how well generated images satisfy key semantic expectations.<n>Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems.
arXiv Detail & Related papers (2025-05-02T17:59:06Z)
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation [25.010417955441653]
Text-to-formed (T2I) models are capable of generating high-quality artistic creations and visual content.<n>We propose $textbfWISE, the first benchmark specifically designed for $textbfWorld Knowledge incorporation$bfIntext $textbfSemantic $textbfE$valuation.
arXiv Detail & Related papers (2025-03-10T12:47:53Z)
PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense. Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge. We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z)
Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects. We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.