Related papers: Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy

Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy

URL: http://arxiv.org/abs/2501.15167v5
Date: Mon, 21 Apr 2025 05:35:25 GMT
Title: Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy
Authors: Yangfan He, Jianhui Wang, Yijin Wang, Kun Li, Yan Zhong, Xinyuan Song, Li Sun, Jingyuan Lu, Miao Zhang, Tianyu Shi, Xinhang Yuan, Kuan Lu, Menghao Huo, Keqin Li, Jiaqi Chen,
Abstract summary: We propose a human-machine co-adaption strategy using mutual information between the user's prompts and the pictures under modification.<n>We find that an improved model can reduce the necessity for multiple rounds of adjustments.
Score: 28.647935556492957
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Today's image generation systems are capable of producing realistic and high-quality images. However, user prompts often contain ambiguities, making it difficult for these systems to interpret users' actual intentions. Consequently, many users must modify their prompts several times to ensure the generated images meet their expectations. While some methods focus on enhancing prompts to make the generated images fit user needs, the model is still hard to understand users' real needs, especially for non-expert users. In this research, we aim to enhance the visual parameter-tuning process, making the model user-friendly for individuals without specialized knowledge and better understand user needs. We propose a human-machine co-adaption strategy using mutual information between the user's prompts and the pictures under modification as the optimizing target to make the system better adapt to user needs. We find that an improved model can reduce the necessity for multiple rounds of adjustments. We also collect multi-round dialogue datasets with prompts and images pairs and user intent. Various experiments demonstrate the effectiveness of the proposed method in our proposed dataset. Our annotation tools and several examples of our dataset are available at https://zenodo.org/records/14876029 for easier review. We will make open source our full dataset and code.

Related papers

VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis [15.392482488365955]
VisualPrompter is a training-free prompt engineering framework that refines user inputs to model-preferred sentences.<n>Our framework achieves new state-of-the-art performance on multiple benchmarks for text-image alignment evaluation.
arXiv Detail & Related papers (2025-06-29T08:24:39Z)
Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation [55.42794740244581]
We propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model.<n> Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt.<n>Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback.
arXiv Detail & Related papers (2025-05-22T15:05:07Z)
Creating General User Models from Computer Use [62.91116265732001]
This paper presents an architecture for a general user model (GUM) that learns about you by observing any interaction you have with your computer.<n>The GUM takes as input any unstructured observation of a user (e.g., device screenshots) and constructs confidence-weighted propositions that capture user knowledge and preferences.
arXiv Detail & Related papers (2025-05-16T04:00:31Z)
Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Preference Understanding [29.191627597682597]
We present a framework incorporating human-in-the-loop feedback, leveraging a well-trained reward model aligned with user preferences. Our approach consistently surpasses competing models in user satisfaction, especially in multi-turn dialogue scenarios.
arXiv Detail & Related papers (2025-04-25T09:35:02Z)
Personalized Image Generation with Large Multimodal Models [47.289887243367055]
We propose a Personalized Image Generation Framework named Pigeon to capture users' visual preferences and needs from noisy user history and multimodal instructions. We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines.
arXiv Detail & Related papers (2024-10-18T04:20:46Z)
Reflective Human-Machine Co-adaptation for Enhanced Text-to-Image Generation Dialogue System [7.009995656535664]
We propose a reflective human-machine co-adaptation strategy, named RHM-CAS. externally, the Agent engages in meaningful language interactions with users to reflect on and refine the generated images. Internally, the Agent tries to optimize the policy based on user preferences, ensuring that the final outcomes closely align with user preferences.
arXiv Detail & Related papers (2024-08-27T18:08:00Z)
What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance [23.411806572667707]
Text-to-image synthesis (TIS) models heavily rely on the quality and specificity of textual prompts. Existing solutions relieve this via automatic model-preferred prompt generation from user queries. We propose DialPrompt, a multi-turn dialogue-based TIS prompt generation model that emphasises user-centricity.
arXiv Detail & Related papers (2024-08-23T08:35:35Z)
Retrieval Augmentation via User Interest Clustering [57.63883506013693]
Industrial recommender systems are sensitive to the patterns of user-item engagement. We propose a novel approach that efficiently constructs user interest and facilitates low computational cost inference. Our approach has been deployed in multiple products at Meta, facilitating short-form video related recommendation.
arXiv Detail & Related papers (2024-08-07T16:35:10Z)
JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset. We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model. Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z)
Prompt Refinement with Image Pivot for Text-to-Image Generation [103.63292948223592]
We introduce Prompt Refinement with Image Pivot (PRIP) for text-to-image generation. PRIP decomposes refinement process into two data-rich tasks: inferring representations of user-preferred images from user languages and translating image representations into system languages. It substantially outperforms a wide range of baselines and effectively transfers to unseen systems in a zero-shot manner.
arXiv Detail & Related papers (2024-06-28T22:19:24Z)
Batch-Instructed Gradient for Prompt Evolution:Systematic Prompt Optimization for Enhanced Text-to-Image Synthesis [3.783530340696776]
This study proposes a Multi-Agent framework to optimize input prompts for text-to-image generation models. A professional prompts database serves as a benchmark to guide the instruction modifier towards generating high-caliber prompts. Preliminary ablation studies highlight the effectiveness of various system components and suggest areas for future improvements.
arXiv Detail & Related papers (2024-06-13T00:33:29Z)
Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations [109.65267337037842]
We introduce the task of Image Editing Recommendation (IER) IER aims to automatically generate diverse creative editing instructions from an input image and a simple prompt representing the users' under-specified editing purpose. We introduce Creativity-Vision Language Assistant(Creativity-VLA), a multimodal framework designed specifically for edit-instruction generation.
arXiv Detail & Related papers (2024-05-31T18:22:29Z)
User-Friendly Customized Generation with Multi-Modal Prompts [21.873076466803145]
We propose a novel integration of text and images tailored to each customization concept. Our proposed paradigm for customized text-to-image generation surpasses existing finetune-based methods in user-friendliness.
arXiv Detail & Related papers (2024-05-26T09:34:16Z)
Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP) We develop a generic and personalization generative framework, that can handle a wide range of personalized needs. Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z)
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z)
DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval. Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP. To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z)
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z)
Clarity ChatGPT: An Interactive and Adaptive Processing System for Image Restoration and Enhancement [97.41630939425731]
We propose a transformative system that combines the conversational intelligence of ChatGPT with multiple IRE methods. Our case studies demonstrate that Clarity ChatGPT effectively improves the generalization and interaction capabilities in the IRE.
arXiv Detail & Related papers (2023-11-20T11:51:13Z)
Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting [13.252755478909899]
We propose a novel approach that involves rewriting user prompts based on a newly collected large-scale text-to-image dataset with over 300k prompts from 3115 users. Our rewriting model enhances the expressiveness and alignment of user prompts with their intended visual outputs.
arXiv Detail & Related papers (2023-10-12T08:36:25Z)
CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models. CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework. Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.