Related papers: World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

URL: http://arxiv.org/abs/2510.04201v1
Date: Sun, 05 Oct 2025 13:35:30 GMT
Title: World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge
Authors: Moo Hyun Son, Jintaek Oh, Sun Bin Mun, Jaechul Roh, Sehyun Choi,
Abstract summary: We introduce World-To-Image, a novel framework that bridges the gap by empowering T2I generation with agent-driven world knowledge.<n>We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model.<n>This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis.
Score: 2.595803115566975
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available here\footnote{https://github.com/mhson-kyle/World-To-Image}.

Related papers

Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models [15.983959465314749]
We introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models.<n>This benchmark consists of 1,100 prompts across three core categories.<n>We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees.
arXiv Detail & Related papers (2025-11-23T03:44:54Z)
Improving Text-to-Image Generation with Input-Side Inference-Time Scaling [47.94598818606364]
We propose a prompt rewriting framework that leverages large language models to refine user inputs before feeding them into T2I backbones.<n>Results show that our prompt rewriter consistently improves image-text alignment, visual quality, and aesthetics, outperforming strong baselines.<n>These findings highlight that prompt rewriting is an effective, scalable, and practical model-agnostic strategy for improving T2I systems.
arXiv Detail & Related papers (2025-10-14T00:51:39Z)
Interleaving Reasoning for Better Text-to-Image Generation [83.69082794730664]
We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis.<n>To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals.<n>Experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN.
arXiv Detail & Related papers (2025-09-08T17:56:23Z)
Replace in Translation: Boost Concept Alignment in Counterfactual Text-to-Image [53.09546752700792]
We propose a strategy to instruct this replacing process, which is called as Explicit Logical Narrative Prompt (ELNP)<n>We design a metric to calculate how many required concepts in the prompt can be covered averagely in the synthesized images.<n>The extensive experiments and qualitative comparisons demonstrate that our strategy can boost the concept alignment in counterfactual T2I.
arXiv Detail & Related papers (2025-05-20T13:27:52Z)
WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation [38.196609962452655]
textbfWorldGenBench is a benchmark designed to evaluate T2I models' world knowledge grounding and implicit inferential capabilities.<n>We propose the textbfKnowledge Checklist Score, a structured metric that measures how well generated images satisfy key semantic expectations.<n>Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems.
arXiv Detail & Related papers (2025-05-02T17:59:06Z)
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework [76.44331001702379]
Lumina-Image 2.0 is a text-to-image generation framework that achieves significant progress compared to previous work.<n>It adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence.<n>We introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks.
arXiv Detail & Related papers (2025-03-27T17:57:07Z)
CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI [58.35348718345307]
Current efforts to distinguish between real and AI-generated images may lack generalization.<n>We propose a novel framework, Co-Spy, that first enhances existing semantic features.<n>We also create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models.
arXiv Detail & Related papers (2025-03-24T01:59:29Z)
TIPS: Text-Image Pretraining with Spatial awareness [13.38247732379754]
Self-supervised image-only pretraining is still the go-to method for many vision applications.<n>We propose a novel general-purpose image-text model, which can be effectively used off the shelf for dense and global vision tasks.
arXiv Detail & Related papers (2024-10-21T21:05:04Z)
Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen) Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z)
Txt2Img-MHN: Remote Sensing Image Generation from Text Using Modern Hopfield Networks [20.856451960761948]
We propose a novel text-to-image modern Hopfield network (Txt2Img-MHN) to generate realistic remote sensing images. To better evaluate the realism and semantic consistency of the generated images, we conduct zero-shot classification on real remote sensing data. Experiments on the benchmark remote sensing text-image dataset demonstrate that the proposed Txt2Img-MHN can generate more realistic remote sensing images.
arXiv Detail & Related papers (2022-08-08T22:02:10Z)
A Shared Representation for Photorealistic Driving Simulators [83.5985178314263]
We propose to improve the quality of generated images by rethinking the discriminator architecture. The focus is on the class of problems where images are generated given semantic inputs, such as scene segmentation maps or human body poses. We aim to learn a shared latent representation that encodes enough information to jointly do semantic segmentation, content reconstruction, along with a coarse-to-fine grained adversarial reasoning.
arXiv Detail & Related papers (2021-12-09T18:59:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.