What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance
- URL: http://arxiv.org/abs/2408.12910v1
- Date: Fri, 23 Aug 2024 08:35:35 GMT
- Title: What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance
- Authors: Yilun Liu, Minggui He, Feiyu Yao, Yuhe Ji, Shimin Tao, Jingzhou Du, Duan Li, Jian Gao, Li Zhang, Hao Yang, Boxing Chen, Osamu Yoshie,
- Abstract summary: Text-to-image synthesis (TIS) models heavily rely on the quality and specificity of textual prompts.
Existing solutions relieve this via automatic model-preferred prompt generation from user queries.
We propose DialPrompt, a multi-turn dialogue-based TIS prompt generation model that emphasises user-centricity.
- Score: 23.411806572667707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The emergence of text-to-image synthesis (TIS) models has significantly influenced digital image creation by producing high-quality visuals from written descriptions. Yet these models heavily rely on the quality and specificity of textual prompts, posing a challenge for novice users who may not be familiar with TIS-model-preferred prompt writing. Existing solutions relieve this via automatic model-preferred prompt generation from user queries. However, this single-turn manner suffers from limited user-centricity in terms of result interpretability and user interactivity. To address these issues, we propose DialPrompt, a multi-turn dialogue-based TIS prompt generation model that emphasises user-centricity. DialPrompt is designed to follow a multi-turn guidance workflow, where in each round of dialogue the model queries user with their preferences on possible optimization dimensions before generating the final TIS prompt. To achieve this, we mined 15 essential dimensions for high-quality prompts from advanced users and curated a multi-turn dataset. Through training on this dataset, DialPrompt can improve interpretability by allowing users to understand the correlation between specific phrases and image attributes. Additionally, it enables greater user control and engagement in the prompt generation process, leading to more personalized and visually satisfying outputs. Experiments indicate that DialPrompt achieves a competitive result in the quality of synthesized images, outperforming existing prompt engineering approaches by 5.7%. Furthermore, in our user evaluation, DialPrompt outperforms existing approaches by 46.5% in user-centricity score and is rated 7.9/10 by 19 human reviewers.
Related papers
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems [60.27663010453209]
We leverage large language models (LLMs) to generate satisfaction-aware counterfactual dialogues.
We gather human annotations to ensure the reliability of the generated samples.
Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems.
arXiv Detail & Related papers (2024-03-27T23:45:31Z) - PromptCharm: Text-to-Image Generation through Multi-modal Prompting and
Refinement [12.55886762028225]
We propose PromptCharm, a system that facilitates text-to-image creation through multi-modal prompt engineering and refinement.
PromptCharm first automatically refines and optimize the user's initial prompt.
It supports the user in exploring and selecting different image styles within a large database.
It renders model explanations by visualizing the model's attention values.
arXiv Detail & Related papers (2024-03-06T19:55:01Z) - A User-Friendly Framework for Generating Model-Preferred Prompts in
Text-to-Image Synthesis [33.71897211776133]
Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images.
It is challenging for novice users to achieve the desired results by manually entering prompts.
We propose a novel framework that automatically translates user-input prompts into model-preferred prompts.
arXiv Detail & Related papers (2024-02-20T06:58:49Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - RELIC: Investigating Large Language Model Responses using Self-Consistency [58.63436505595177]
Large Language Models (LLMs) are notorious for blending fact with fiction and generating non-factual content, known as hallucinations.
We propose an interactive system that helps users gain insight into the reliability of the generated text.
arXiv Detail & Related papers (2023-11-28T14:55:52Z) - The Chosen One: Consistent Characters in Text-to-Image Diffusion Models [71.15152184631951]
We propose a fully automated solution for consistent character generation with the sole input being a text prompt.
Our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods.
arXiv Detail & Related papers (2023-11-16T18:59:51Z) - Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting [13.252755478909899]
We propose a novel approach that involves rewriting user prompts based on a newly collected large-scale text-to-image dataset with over 300k prompts from 3115 users.
Our rewriting model enhances the expressiveness and alignment of user prompts with their intended visual outputs.
arXiv Detail & Related papers (2023-10-12T08:36:25Z) - PromptMagician: Interactive Prompt Engineering for Text-to-Image
Creation [16.41459454076984]
This research proposes PromptMagician, a visual analysis system that helps users explore the image results and refine the input prompts.
The backbone of our system is a prompt recommendation model that takes user prompts as input, retrieves similar prompt-image pairs from DiffusionDB, and identifies special (important and relevant) prompt keywords.
arXiv Detail & Related papers (2023-07-18T07:46:25Z) - Promptify: Text-to-Image Generation through Interactive Prompt
Exploration with Large Language Models [29.057923932305123]
We present Promptify, an interactive system that supports prompt exploration and refinement for text-to-image generative models.
Our user study shows that Promptify effectively facilitates the text-to-image workflow and outperforms an existing baseline tool widely used for text-to-image generation.
arXiv Detail & Related papers (2023-04-18T22:59:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.