Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty
- URL: http://arxiv.org/abs/2412.06771v1
- Date: Mon, 09 Dec 2024 18:56:32 GMT
- Title: Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty
- Authors: Meera Hahn, Wenjun Zeng, Nithish Kannen, Rich Galt, Kartikeya Badola, Been Kim, Zi Wang,
- Abstract summary: We propose a design for proactive T2I agents equipped with an interface to actively ask clarification questions when uncertain.
We build simple prototypes for such agents and verify their effectiveness through both human studies and automated evaluation.
We observed that these T2I agents were able to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore than the standard single-turn T2I generation.
- Score: 45.075328946207826
- License:
- Abstract: User prompts for generative AI models are often underspecified, leading to sub-optimal responses. This problem is particularly evident in text-to-image (T2I) generation, where users commonly struggle to articulate their precise intent. This disconnect between the user's vision and the model's interpretation often forces users to painstakingly and repeatedly refine their prompts. To address this, we propose a design for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their understanding of user intent as an understandable belief graph that a user can edit. We build simple prototypes for such agents and verify their effectiveness through both human studies and automated evaluation. We observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow. Moreover, we develop a scalable automated evaluation approach using two agents, one with a ground truth image and the other tries to ask as few questions as possible to align with the ground truth. On DesignBench, a benchmark we created for artists and designers, the COCO dataset (Lin et al., 2014), and ImageInWords (Garg et al., 2024), we observed that these T2I agents were able to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard single-turn T2I generation. Demo: https://github.com/google-deepmind/proactive_t2i_agents.
Related papers
- Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent [9.748808189341526]
An effective Text-to-Image (T2I) evaluation metric should accomplish the following: detect instances where the generated images do not align with the textual prompts.
We propose a method based on large language models (LLMs) for conducting question-answering with an extracted scene-graph and created a dataset with human-rated scores for generated images.
arXiv Detail & Related papers (2024-12-07T18:44:38Z) - ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting [18.002126814513417]
ChatGen-Evo is a multi-stage evolution strategy that progressively equips models with essential automation skills.
ChatGen-Evo significantly enhances performance over various baselines.
arXiv Detail & Related papers (2024-11-26T07:31:12Z) - Text-to-Image Synthesis: A Decade Survey [7.250878248686215]
Text-to-image synthesis (T2I) focuses on generating high-quality images from textual descriptions.
In this survey, we review over 440 recent works on T2I.
arXiv Detail & Related papers (2024-11-25T07:40:32Z) - Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping [55.98643055756135]
We introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes.
We analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs.
A user study with UI/UX experts reveals a significant preference for proactive question-asking over passive feedback reception.
arXiv Detail & Related papers (2024-10-21T17:39:49Z) - DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model [90.71963723884944]
Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research.
We introduce DiffAgent, an agent designed to screen the accurate selection in seconds via API calls.
Our evaluations reveal that DiffAgent not only excels in identifying the appropriate T2I API but also underscores the effectiveness of the SFTA training framework.
arXiv Detail & Related papers (2024-03-31T06:28:15Z) - SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z) - Position: Towards Implicit Prompt For Text-To-Image Models [57.00716011456852]
This paper highlights the current state of text-to-image (T2I) models toward implicit prompts.
We present a benchmark named ImplicitBench and conduct an investigation on the performance and impacts of implicit prompts.
Experiment results show that T2I models are able to accurately create various target symbols indicated by implicit prompts.
arXiv Detail & Related papers (2024-03-04T15:21:51Z) - Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation [19.06501699814924]
We build the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing implicitly adversarial prompts.
The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models.
We find that 14% of images that humans consider harmful are mislabeled as safe'' by machines.
arXiv Detail & Related papers (2024-02-14T22:21:12Z) - ACTesting: Automated Cross-modal Testing Method of Text-to-Image Software [9.351572210564134]
ACTesting is an automated cross-modal testing method for Text-to-Image software.
In our experiments, ACTesting effectively generates error-revealing tests, resulting in a decrease in text-image consistency by up to 20%.
Results validate that ACTesting can reliably identify errors within T2I software.
arXiv Detail & Related papers (2023-12-20T11:19:23Z) - Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation [115.63085345822175]
We introduce Idea to Image'', a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation.
We investigate if systems based on large multimodal models (LMMs) can develop analogous multimodal self-refinement abilities.
arXiv Detail & Related papers (2023-10-12T17:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.