TDRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation
- URL: http://arxiv.org/abs/2503.17669v1
- Date: Sat, 22 Mar 2025 06:40:21 GMT
- Title: TDRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation
- Authors: Yuheng Feng, Jianhui Wang, Kun Li, Sida Li, Tianyu Shi, Haoyue Han, Miao Zhang, Xueqian Wang,
- Abstract summary: I.I. (Two-Phase Dialogue Refinement and Co-Adaptation) addresses issues by enhancing image generation through iterative user interaction.<n>It consists of two phases: the Initial Generation Phase, which creates base images based on user prompts, and the Interactive Refinement Phase, which integrates user feedback through three key modules.<n>I.I. exhibits strong potential for a wide range of applications in the creative and industrial domains.
- Score: 19.229851510402952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although text-to-image generation technologies have made significant advancements, they still face challenges when dealing with ambiguous prompts and aligning outputs with user intent.Our proposed framework, TDRI (Two-Phase Dialogue Refinement and Co-Adaptation), addresses these issues by enhancing image generation through iterative user interaction. It consists of two phases: the Initial Generation Phase, which creates base images based on user prompts, and the Interactive Refinement Phase, which integrates user feedback through three key modules. The Dialogue-to-Prompt (D2P) module ensures that user feedback is effectively transformed into actionable prompts, which improves the alignment between user intent and model input. By evaluating generated outputs against user expectations, the Feedback-Reflection (FR) module identifies discrepancies and facilitates improvements. In an effort to ensure consistently high-quality results, the Adaptive Optimization (AO) module fine-tunes the generation process by balancing user preferences and maintaining prompt fidelity. Experimental results show that TDRI outperforms existing methods by achieving 33.6% human preference, compared to 6.2% for GPT-4 augmentation, and the highest CLIP and BLIP alignment scores (0.338 and 0.336, respectively). In iterative feedback tasks, user satisfaction increased to 88% after 8 rounds, with diminishing returns beyond 6 rounds. Furthermore, TDRI has been found to reduce the number of iterations and improve personalization in the creation of fashion products. TDRI exhibits a strong potential for a wide range of applications in the creative and industrial domains, as it streamlines the creative process and improves alignment with user preferences
Related papers
- Search-Based Interaction For Conversation Recommendation via Generative Reward Model Based Simulated User [117.82681846559909]
Conversational recommendation systems (CRSs) use multi-turn interaction to capture user preferences and provide personalized recommendations.
We propose a generative reward model based simulated user, named GRSU, for automatic interaction with CRSs.
arXiv Detail & Related papers (2025-04-29T06:37:30Z) - Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Preference Understanding [29.191627597682597]
We present a framework incorporating human-in-the-loop feedback, leveraging a well-trained reward model aligned with user preferences.
Our approach consistently surpasses competing models in user satisfaction, especially in multi-turn dialogue scenarios.
arXiv Detail & Related papers (2025-04-25T09:35:02Z) - Reasoning LLMs for User-Aware Multimodal Conversational Agents [3.533721662684487]
Personalization in social robotics is critical for fostering effective human-robot interactions.
This paper proposes a novel framework called USER-LLM R1 for a user-aware conversational agent.
Our approach integrates chain-of-thought (CoT) reasoning models to iteratively infer user preferences and vision-language models.
arXiv Detail & Related papers (2025-04-02T13:00:17Z) - OMR-Diffusion:Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Intent Understanding [21.101906599201314]
We present a Visual Co-Adaptation framework that incorporates human-in-the-loop feedback.<n>The framework applies multiple reward functions (such as diversity, consistency, and preference feedback) to refine the diffusion model.<n> Experiments show the model achieves 508 wins in human evaluation, outperforming DALL-E 3 (463 wins) and others.
arXiv Detail & Related papers (2025-03-22T06:10:57Z) - Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy [20.954269395301885]
We propose a human-machine co-adaption strategy using mutual information between the user's prompts and the pictures under modification.<n>We find that an improved model can reduce the necessity for multiple rounds of adjustments.
arXiv Detail & Related papers (2025-01-25T10:32:00Z) - What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance [23.411806572667707]
Text-to-image synthesis (TIS) models heavily rely on the quality and specificity of textual prompts.
Existing solutions relieve this via automatic model-preferred prompt generation from user queries.
We propose DialPrompt, a multi-turn dialogue-based TIS prompt generation model that emphasises user-centricity.
arXiv Detail & Related papers (2024-08-23T08:35:35Z) - ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising.
We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z) - CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems [60.27663010453209]
We leverage large language models (LLMs) to generate satisfaction-aware counterfactual dialogues.
We gather human annotations to ensure the reliability of the generated samples.
Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems.
arXiv Detail & Related papers (2024-03-27T23:45:31Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Feature Decoupling-Recycling Network for Fast Interactive Segmentation [79.22497777645806]
Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input.
We propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies.
arXiv Detail & Related papers (2023-08-07T12:26:34Z) - Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations [67.92679668612858]
We propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals.
Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings; and (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions.
On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its
arXiv Detail & Related papers (2023-06-03T11:50:44Z) - Joint Turn and Dialogue level User Satisfaction Estimation on
Multi-Domain Conversations [6.129731338249762]
Current automated methods to estimate turn and dialogue level user satisfaction employ hand-crafted features.
We propose a novel user satisfaction estimation approach which minimizes an adaptive multi-task loss function.
The BiLSTM based deep neural net model automatically weighs each turn's contribution towards the estimated dialogue-level rating.
arXiv Detail & Related papers (2020-10-06T05:53:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.