Related papers: Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

URL: http://arxiv.org/abs/2512.03534v1
Date: Wed, 03 Dec 2025 07:54:05 GMT
Title: Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
Authors: Subin Kim, Sangwoo Mo, Mamshad Nayeem Rizve, Yiran Xu, Difan Liu, Jinwoo Shin, Tobias Hinz,
Abstract summary: We propose Prompt Redesign for Inference-time Scaling, a framework that adaptively revises the prompt during inference in response to scaled visual generations.<n>We introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level.<n>Experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach.
Score: 63.042451267669485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.

Related papers

Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing [76.2602505940467]
Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination.<n>Inspired by the human strategy of using a finger as a visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR)<n>The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors.
arXiv Detail & Related papers (2026-02-18T13:40:53Z)
VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? [51.02924254085878]
Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs.<n>We introduce VISTA-Bench, a benchmark from multimodal perception, reasoning, to unimodal understanding domains.
arXiv Detail & Related papers (2026-02-04T17:48:55Z)
Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models [50.87835332136393]
Chain-of-Thought (CoT) has largely improved the generation ability of unified models.<n>In this paper, we introduce visual context consistency into the reasoning of unified models.<n>We use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency.
arXiv Detail & Related papers (2025-12-22T18:59:03Z)
No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation [14.417173544864298]
We propose a fine-grained test-time optimization framework that enhances compositional faithfulness in text-to-image (T2I) generation.<n>Our method decomposes the input prompt into semantic concepts and evaluates alignment at both the global and concept levels.<n> Experiments on DrawBench and CompBench prompts demonstrate that our method significantly improves concept coverage and human-judged faithfulness.
arXiv Detail & Related papers (2025-09-27T18:59:49Z)
Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z)
TokBench: Evaluating Your Visual Tokenizer before Visual Generation [75.38270351179018]
We analyze text and face reconstruction quality across various scales for different image tokenizers and VAEs.<n>Our results show modern visual tokenizers still struggle to preserve fine-grained features, especially at smaller scales.
arXiv Detail & Related papers (2025-05-23T17:52:16Z)
Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach [29.735863112700358]
We study the effectiveness of prompting a transformer-decoder with learned visual prompts for the generalized few-shot segmentation (GFSS) task. Our goal is to achieve strong performance not only on novel categories with limited examples, but also to retain performance on base categories. We introduce a unidirectional causal attention mechanism between the novel prompts, learned with limited examples, and the base prompts, learned with abundant data.
arXiv Detail & Related papers (2024-04-17T20:35:00Z)
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models [28.983503845298824]
We show that synthetic text images are good visual prompts for vision-language models! We propose our LoGoPrompt, which reformulates the classification objective to the visual prompt selection. Our method consistently outperforms state-of-the-art methods in few-shot learning, base-to-new generalization, and domain generalization.
arXiv Detail & Related papers (2023-09-03T12:23:33Z)
Progressive Visual Prompt Learning with Contrastive Feature Re-formation [15.385630262368661]
We propose a new Progressive Visual Prompt (ProVP) structure to strengthen the interactions among prompts of different layers. Our ProVP could effectively propagate the image embeddings to deep layers and behave partially similar to an instance adaptive prompt method. To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks.
arXiv Detail & Related papers (2023-04-17T15:54:10Z)
Rethinking Visual Prompt Learning as Masked Visual Token Modeling [106.71983630652323]
We propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction. VPTM is the first visual prompt method on the generative pre-trained visual model, which achieves consistency between pre-training and downstream visual classification by task reformulation.
arXiv Detail & Related papers (2023-03-09T02:43:10Z)
Visually-augmented pretrained language models for NLP tasks without images [77.74849855049523]
Existing solutions often rely on explicit images for visual knowledge augmentation. We propose a novel textbfVisually-textbfAugmented fine-tuning approach. Our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales.
arXiv Detail & Related papers (2022-12-15T16:13:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.