Related papers: Test-time Prompt Refinement for Text-to-Image Models

Test-time Prompt Refinement for Text-to-Image Models

URL: http://arxiv.org/abs/2507.22076v1
Date: Tue, 22 Jul 2025 20:30:13 GMT
Title: Test-time Prompt Refinement for Text-to-Image Models
Authors: Mohammad Abdul Hafeez Khan, Yash Jain, Siddhartha Bhattacharyya, Vibhav Vineet,
Abstract summary: We introduce a test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR.<n>In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user's prompt.<n>We demonstrate that this closed-loop strategy improves alignment and visual coherence across multiple benchmark datasets, all while maintaining plug-and-play integration with black-box T2I models.
Score: 14.505841027491114
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image (T2I) generation models have made significant strides but still struggle with prompt sensitivity: even minor changes in prompt wording can yield inconsistent or inaccurate outputs. To address this challenge, we introduce a closed-loop, test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR. In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user's prompt. The MLLM detects misalignments (e.g., missing objects, incorrect attributes) and produces a refined and physically grounded prompt for the next round of image generation. By iteratively refining the prompt and verifying alignment between the prompt and the image, TIR corrects errors, mirroring the iterative refinement process of human artists. We demonstrate that this closed-loop strategy improves alignment and visual coherence across multiple benchmark datasets, all while maintaining plug-and-play integration with black-box T2I models.

Related papers

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Aligning Text to Image in Diffusion Models is Easier Than You Think [47.623236425067326]
SoftREPA is a lightweight contrastive fine-tuning strategy that leverages soft text tokens for representation alignment.<n>Our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency.
arXiv Detail & Related papers (2025-03-11T10:14:22Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis [10.47359822447001]
We present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps.<n>Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models.
arXiv Detail & Related papers (2024-12-08T22:29:56Z)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model. We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z)
FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting [18.708185548091716]
FRAP is a simple, yet effective approach based on adaptively adjusting the per-token prompt weights.<n>We show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets.<n>We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment.
arXiv Detail & Related papers (2024-08-21T15:30:35Z)
Information Theoretic Text-to-Image Alignment [49.396917351264655]
Mutual Information (MI) is used to guide model alignment.<n>Our method uses self-supervised fine-tuning and relies on a point-wise (MI) estimation between prompts and images.<n>Our analysis indicates that our method is superior to the state-of-the-art, yet it only requires the pre-trained denoising network of the T2I model itself to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z)
Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation [149.96612254604986]
PRISM is an algorithm that automatically produces human-interpretable and transferable prompts.<n>Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompt distribution.<n>Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles, and images across multiple T2I models.
arXiv Detail & Related papers (2024-03-28T02:35:53Z)
Improving Text-to-Image Consistency via Automatic Prompt Optimization [26.2587505265501]
We introduce a T2I optimization-by-prompting framework, OPT2I, to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score.
arXiv Detail & Related papers (2024-03-26T15:42:01Z)
DivCon: Divide and Conquer for Progressive Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. layout is employed as an intermedium to bridge large language models and layout-based diffusion models. We introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks.
arXiv Detail & Related papers (2024-03-11T03:24:44Z)
If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt. We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts. We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.