No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation
- URL: http://arxiv.org/abs/2509.23457v1
- Date: Sat, 27 Sep 2025 18:59:49 GMT
- Title: No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation
- Authors: Mohammad Hossein Sameti, Amir M. Mansourian, Arash Marioriyad, Soheil Fadaee Oshyani, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah,
- Abstract summary: We propose a fine-grained test-time optimization framework that enhances compositional faithfulness in text-to-image (T2I) generation.<n>Our method decomposes the input prompt into semantic concepts and evaluates alignment at both the global and concept levels.<n> Experiments on DrawBench and CompBench prompts demonstrate that our method significantly improves concept coverage and human-judged faithfulness.
- Score: 14.417173544864298
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite recent advances in text-to-image (T2I) models, they often fail to faithfully render all elements of complex prompts, frequently omitting or misrepresenting specific objects and attributes. Test-time optimization has emerged as a promising approach to address this limitation by refining generation without the need for retraining. In this paper, we propose a fine-grained test-time optimization framework that enhances compositional faithfulness in T2I generation. Unlike most of prior approaches that rely solely on a global image/text similarity score, our method decomposes the input prompt into semantic concepts and evaluates alignment at both the global and concept levels. A fine-grained variant of CLIP is used to compute concept-level correspondence, producing detailed feedback on missing or inaccurate concepts. This feedback is fed into an iterative prompt refinement loop, enabling the large language model to propose improved prompts. Experiments on DrawBench and CompBench prompts demonstrate that our method significantly improves concept coverage and human-judged faithfulness over both standard test-time optimization and the base T2I model. Code is available at: https://github.com/AmirMansurian/NoConceptLeftBehind
Related papers
- Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation [63.042451267669485]
We propose Prompt Redesign for Inference-time Scaling, a framework that adaptively revises the prompt during inference in response to scaled visual generations.<n>We introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level.<n>Experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2025-12-03T07:54:05Z) - RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling [59.088798018184235]
textbfRAPO++ is a cross-stage prompt optimization framework.<n>It unifies training-data-aligned refinement, test-time iterative scaling, and large language model fine-tuning.<n> RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility.
arXiv Detail & Related papers (2025-10-23T04:45:09Z) - Improving Text-to-Image Generation with Input-Side Inference-Time Scaling [47.94598818606364]
We propose a prompt rewriting framework that leverages large language models to refine user inputs before feeding them into T2I backbones.<n>Results show that our prompt rewriter consistently improves image-text alignment, visual quality, and aesthetics, outperforming strong baselines.<n>These findings highlight that prompt rewriting is an effective, scalable, and practical model-agnostic strategy for improving T2I systems.
arXiv Detail & Related papers (2025-10-14T00:51:39Z) - ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization [20.935028961216325]
ConceptMix++ is a framework that disentangles prompt phrasing from visual generation capabilities.<n>We show that optimized prompts significantly improve compositional generation performance.<n>These findings demonstrate that rigid benchmarking approaches may significantly underrepresent true model capabilities.
arXiv Detail & Related papers (2025-07-04T03:27:04Z) - RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning [88.14234949860105]
RePrompt is a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning.<n>Our approach enables end-to-end training without human-annotated data.
arXiv Detail & Related papers (2025-05-23T06:44:26Z) - Replace in Translation: Boost Concept Alignment in Counterfactual Text-to-Image [53.09546752700792]
We propose a strategy to instruct this replacing process, which is called as Explicit Logical Narrative Prompt (ELNP)<n>We design a metric to calculate how many required concepts in the prompt can be covered averagely in the synthesized images.<n>The extensive experiments and qualitative comparisons demonstrate that our strategy can boost the concept alignment in counterfactual T2I.
arXiv Detail & Related papers (2025-05-20T13:27:52Z) - Fast Prompt Alignment for Text-to-Image Generation [28.66112701912297]
This paper introduces Fast Prompt Alignment (FPA), a prompt optimization framework that leverages a one-pass approach.<n>FPA uses large language models (LLMs) for single-iteration prompt paraphrasing, followed by fine-tuning or in-context learning with optimized prompts.<n>FPA achieves competitive text-image alignment scores at a fraction of the processing time.
arXiv Detail & Related papers (2024-12-11T18:58:41Z) - FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting [18.708185548091716]
FRAP is a simple, yet effective approach based on adaptively adjusting the per-token prompt weights.<n>We show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets.<n>We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment.
arXiv Detail & Related papers (2024-08-21T15:30:35Z) - T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation [55.16845189272573]
T2I-CompBench++ is an enhanced benchmark for compositional text-to-image generation.<n>It comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions.
arXiv Detail & Related papers (2023-07-12T17:59:42Z) - Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image
Alignment with Iterative VQA Feedback [20.78162037954646]
We introduce a decompositional approach towards evaluation and improvement of text-to-image alignment.
Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy.
arXiv Detail & Related papers (2023-07-10T17:54:57Z) - If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt.
We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts.
We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z) - ELITE: Encoding Visual Concepts into Textual Embeddings for Customized
Text-to-Image Generation [59.44301617306483]
We propose a learning-based encoder for fast and accurate customized text-to-image generation.
Our method enables high-fidelity inversion and more robust editability with a significantly faster encoding process.
arXiv Detail & Related papers (2023-02-27T14:49:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.