Enhancing Intent Understanding for Ambiguous Prompts through Human-Machine Co-Adaptation
- URL: http://arxiv.org/abs/2501.15167v2
- Date: Sun, 16 Feb 2025 18:02:47 GMT
- Title: Enhancing Intent Understanding for Ambiguous Prompts through Human-Machine Co-Adaptation
- Authors: Yangfan He, Jianhui Wang, Yijin Wang, Kun Li, Li Sun, Jiayi Su, Jingyuan Lu, Jinhua Song, Haoyuan Li, Sida Li, Tianyu Shi, Miao Zhang,
- Abstract summary: We propose a human-machine co-adaption strategy using mutual information between the user's prompts and the pictures under modification.
We find that an improved model can reduce the necessity for multiple rounds of adjustments.
- Score: 20.954269395301885
- License:
- Abstract: Today's image generation systems are capable of producing realistic and high-quality images. However, user prompts often contain ambiguities, making it difficult for these systems to interpret users' actual intentions. Consequently, many users must modify their prompts several times to ensure the generated images meet their expectations. While some methods focus on enhancing prompts to make the generated images fit user needs, the model is still hard to understand users' real needs, especially for non-expert users. In this research, we aim to enhance the visual parameter-tuning process, making the model user-friendly for individuals without specialized knowledge and better understand user needs. We propose a human-machine co-adaption strategy using mutual information between the user's prompts and the pictures under modification as the optimizing target to make the system better adapt to user needs. We find that an improved model can reduce the necessity for multiple rounds of adjustments. We also collect multi-round dialogue datasets with prompts and images pairs and user intent. Various experiments demonstrate the effectiveness of the proposed method in our proposed dataset. Our annotation tools and several examples of our dataset are available at https://zenodo.org/records/14876029 for easier review. And we will open source our full dataset and code.
Related papers
- Personalized Image Generation with Large Multimodal Models [47.289887243367055]
We propose a Personalized Image Generation Framework named Pigeon to capture users' visual preferences and needs from noisy user history and multimodal instructions.
We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines.
arXiv Detail & Related papers (2024-10-18T04:20:46Z) - Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System [7.009995656535664]
We propose a reflective human-machine co-adaptation strategy, named RHM-CAS.
externally, the Agent engages in meaningful language interactions with users to reflect on and refine the generated images.
Internally, the Agent tries to optimize the policy based on user preferences, ensuring that the final outcomes closely align with user preferences.
arXiv Detail & Related papers (2024-08-27T18:08:00Z) - What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance [23.411806572667707]
Text-to-image synthesis (TIS) models heavily rely on the quality and specificity of textual prompts.
Existing solutions relieve this via automatic model-preferred prompt generation from user queries.
We propose DialPrompt, a multi-turn dialogue-based TIS prompt generation model that emphasises user-centricity.
arXiv Detail & Related papers (2024-08-23T08:35:35Z) - JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset.
We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model.
Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z) - Prompt Refinement with Image Pivot for Text-to-Image Generation [103.63292948223592]
We introduce Prompt Refinement with Image Pivot (PRIP) for text-to-image generation.
PRIP decomposes refinement process into two data-rich tasks: inferring representations of user-preferred images from user languages and translating image representations into system languages.
It substantially outperforms a wide range of baselines and effectively transfers to unseen systems in a zero-shot manner.
arXiv Detail & Related papers (2024-06-28T22:19:24Z) - Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations [109.65267337037842]
We introduce the task of Image Editing Recommendation (IER)
IER aims to automatically generate diverse creative editing instructions from an input image and a simple prompt representing the users' under-specified editing purpose.
We introduce Creativity-Vision Language Assistant(Creativity-VLA), a multimodal framework designed specifically for edit-instruction generation.
arXiv Detail & Related papers (2024-05-31T18:22:29Z) - User-Friendly Customized Generation with Multi-Modal Prompts [21.873076466803145]
We propose a novel integration of text and images tailored to each customization concept.
Our proposed paradigm for customized text-to-image generation surpasses existing finetune-based methods in user-friendliness.
arXiv Detail & Related papers (2024-05-26T09:34:16Z) - Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP)
We develop a generic and personalization generative framework, that can handle a wide range of personalized needs.
Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z) - Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting [13.252755478909899]
We propose a novel approach that involves rewriting user prompts based on a newly collected large-scale text-to-image dataset with over 300k prompts from 3115 users.
Our rewriting model enhances the expressiveness and alignment of user prompts with their intended visual outputs.
arXiv Detail & Related papers (2023-10-12T08:36:25Z) - PromptMagician: Interactive Prompt Engineering for Text-to-Image
Creation [16.41459454076984]
This research proposes PromptMagician, a visual analysis system that helps users explore the image results and refine the input prompts.
The backbone of our system is a prompt recommendation model that takes user prompts as input, retrieves similar prompt-image pairs from DiffusionDB, and identifies special (important and relevant) prompt keywords.
arXiv Detail & Related papers (2023-07-18T07:46:25Z) - CAISE: Conversational Agent for Image Search and Editing [109.57721903485663]
We propose a dataset of an automated Conversational Agent for Image Search and Editing (CAISE)
To our knowledge, this is the first dataset that provides conversational image search and editing annotations.
The functions that the assistant-annotators conduct with the tool are recorded as executable commands.
arXiv Detail & Related papers (2022-02-24T00:55:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.