Related papers: Taming Text-to-Image Synthesis for Novices: User-centric Prompt Generation via Multi-turn Guidance

Taming Text-to-Image Synthesis for Novices: User-centric Prompt Generation via Multi-turn Guidance

URL: http://arxiv.org/abs/2408.12910v2
Date: Tue, 14 Oct 2025 08:27:19 GMT
Title: Taming Text-to-Image Synthesis for Novices: User-centric Prompt Generation via Multi-turn Guidance
Authors: Yilun Liu, Minggui He, Feiyu Yao, Yuhe Ji, Shimin Tao, Jingzhou Du, Duan Li, Jian Gao, Li Zhang, Hao Yang, Boxing Chen, Osamu Yoshie,
Abstract summary: DialPrompt is a dialogue-based TIS prompt generation model that emphasizes user experience for novice users.<n>To achieve this, we mined 15 essential dimensions for high-quality prompts from advanced users and curated a multi-turn dataset.<n>Experiments indicate that DialPrompt improves significantly in user-centricity score compared with existing approaches.
Score: 24.432762962671614
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The emergence of text-to-image synthesis (TIS) models has significantly influenced digital image creation by producing high-quality visuals from written descriptions. Yet these models are sensitive on textual prompts, posing a challenge for novice users who may not be familiar with TIS prompt writing. Existing solutions relieve this via automatic prompt expansion or generation from a user query. However, this single-turn manner suffers from limited user-centricity in terms of result interpretability and user interactivity. Thus, we propose DialPrompt, a dialogue-based TIS prompt generation model that emphasizes user experience for novice users. DialPrompt is designed to follow a multi-turn workflow, where in each round of dialogue the model guides user to express their preferences on possible optimization dimensions before generating the final TIS prompt. To achieve this, we mined 15 essential dimensions for high-quality prompts from advanced users and curated a multi-turn dataset. Through training on this dataset, DialPrompt improves user-centricity by allowing users to perceive and control the creation process of TIS prompts. Experiments indicate that DialPrompt improves significantly in user-centricity score compared with existing approaches while maintaining a competitive quality of synthesized images. In our user evaluation, DialPrompt is highly rated by 19 human reviewers (especially novices).

Related papers

Adaptive Prompt Elicitation for Text-to-Image Generation [31.242444699785697]
APE represents latent intent as interpretable feature requirements using language model priors.<n>A user study with challenging user-defined tasks demonstrates 19.8% higher alignment without workload overhead.
arXiv Detail & Related papers (2026-02-04T16:24:46Z)
PromptSculptor: Multi-Agent Based Text-to-Image Prompt Optimization [4.133498001057646]
To generate high-quality images, users must craft detailed prompts specifying scene, style, and context.<n>We propose PromptSculptor, a novel multi-agent framework that automates this iterative prompt optimization process.<n>Our system decomposes the task into four specialized agents that work collaboratively to transform a short, vague user prompt into a comprehensive, refined prompt.
arXiv Detail & Related papers (2025-09-15T20:52:11Z)
VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis [15.392482488365955]
VisualPrompter is a training-free prompt engineering framework that refines user inputs to model-preferred sentences.<n>Our framework achieves new state-of-the-art performance on multiple benchmarks for text-image alignment evaluation.
arXiv Detail & Related papers (2025-06-29T08:24:39Z)
RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning [88.14234949860105]
RePrompt is a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning.<n>Our approach enables end-to-end training without human-annotated data.
arXiv Detail & Related papers (2025-05-23T06:44:26Z)
Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation [55.42794740244581]
We propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model.<n> Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt.<n>Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback.
arXiv Detail & Related papers (2025-05-22T15:05:07Z)
Multi-agents based User Values Mining for Recommendation [52.26100802380767]
We propose a zero-shot multi-LLM collaborative framework for effective and accurate user value extraction.<n>We apply text summarization techniques to condense item content while preserving essential meaning.<n>To mitigate hallucinations, we introduce two specialized agent roles: evaluators and supervisors.
arXiv Detail & Related papers (2025-05-02T04:01:31Z)
Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias [52.590072198551944]
The aim of image personalization is to create images based on a user-provided subject. Current methods face challenges in ensuring fidelity to the text prompt. We introduce a novel training pipeline that incorporates an attractor to filter out distractions in training images.
arXiv Detail & Related papers (2025-03-09T14:14:02Z)
Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy [28.647935556492957]
We propose a human-machine co-adaption strategy using mutual information between the user's prompts and the pictures under modification. We find that an improved model can reduce the necessity for multiple rounds of adjustments.
arXiv Detail & Related papers (2025-01-25T10:32:00Z)
Prompt Refinement with Image Pivot for Text-to-Image Generation [103.63292948223592]
We introduce Prompt Refinement with Image Pivot (PRIP) for text-to-image generation. PRIP decomposes refinement process into two data-rich tasks: inferring representations of user-preferred images from user languages and translating image representations into system languages. It substantially outperforms a wide range of baselines and effectively transfers to unseen systems in a zero-shot manner.
arXiv Detail & Related papers (2024-06-28T22:19:24Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects. We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z)
CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems [60.27663010453209]
We leverage large language models (LLMs) to generate satisfaction-aware counterfactual dialogues. We gather human annotations to ensure the reliability of the generated samples. Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems.
arXiv Detail & Related papers (2024-03-27T23:45:31Z)
PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement [12.55886762028225]
We propose PromptCharm, a system that facilitates text-to-image creation through multi-modal prompt engineering and refinement. PromptCharm first automatically refines and optimize the user's initial prompt. It supports the user in exploring and selecting different image styles within a large database. It renders model explanations by visualizing the model's attention values.
arXiv Detail & Related papers (2024-03-06T19:55:01Z)
A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis [33.71897211776133]
Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images. It is challenging for novice users to achieve the desired results by manually entering prompts. We propose a novel framework that automatically translates user-input prompts into model-preferred prompts.
arXiv Detail & Related papers (2024-02-20T06:58:49Z)
DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval. Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP. To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z)
RELIC: Investigating Large Language Model Responses using Self-Consistency [58.63436505595177]
Large Language Models (LLMs) are notorious for blending fact with fiction and generating non-factual content, known as hallucinations. We propose an interactive system that helps users gain insight into the reliability of the generated text.
arXiv Detail & Related papers (2023-11-28T14:55:52Z)
The Chosen One: Consistent Characters in Text-to-Image Diffusion Models [71.15152184631951]
We propose a fully automated solution for consistent character generation with the sole input being a text prompt. Our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods.
arXiv Detail & Related papers (2023-11-16T18:59:51Z)
Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting [13.252755478909899]
We propose a novel approach that involves rewriting user prompts based on a newly collected large-scale text-to-image dataset with over 300k prompts from 3115 users. Our rewriting model enhances the expressiveness and alignment of user prompts with their intended visual outputs.
arXiv Detail & Related papers (2023-10-12T08:36:25Z)
PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation [16.41459454076984]
This research proposes PromptMagician, a visual analysis system that helps users explore the image results and refine the input prompts. The backbone of our system is a prompt recommendation model that takes user prompts as input, retrieves similar prompt-image pairs from DiffusionDB, and identifies special (important and relevant) prompt keywords.
arXiv Detail & Related papers (2023-07-18T07:46:25Z)
Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models [29.057923932305123]
We present Promptify, an interactive system that supports prompt exploration and refinement for text-to-image generative models. Our user study shows that Promptify effectively facilitates the text-to-image workflow and outperforms an existing baseline tool widely used for text-to-image generation.
arXiv Detail & Related papers (2023-04-18T22:59:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.