Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty
- URL: http://arxiv.org/abs/2412.06771v2
- Date: Wed, 16 Jul 2025 14:08:22 GMT
- Title: Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty
- Authors: Meera Hahn, Wenjun Zeng, Nithish Kannen, Rich Galt, Kartikeya Badola, Been Kim, Zi Wang,
- Abstract summary: We propose a prototype for proactive T2I agents equipped with an interface to actively ask clarification questions when uncertain.<n>We build simple prototypes for such agents and propose a new scalable and automated evaluation approach.<n> Experiments over three image-text datasets demonstrate the proposed T2I agents' ability to ask informative questions and elicit crucial information.
- Score: 45.075328946207826
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: User prompts for generative AI models are often underspecified, leading to a misalignment between the user intent and models' understanding. As a result, users commonly have to painstakingly refine their prompts. We study this alignment problem in text-to-image (T2I) generation and propose a prototype for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their uncertainty about user intent as an understandable and editable belief graph. We build simple prototypes for such agents and propose a new scalable and automated evaluation approach using two agents, one with a ground truth intent (an image) while the other tries to ask as few questions as possible to align with the ground truth. We experiment over three image-text datasets: ImageInWords (Garg et al., 2024), COCO (Lin et al., 2014) and DesignBench, a benchmark we curated with strong artistic and design elements. Experiments over the three datasets demonstrate the proposed T2I agents' ability to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard T2I generation. Moreover, we conducted human studies and observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow, highlighting the effectiveness of our approach. Code and DesignBench can be found at https://github.com/google-deepmind/proactive_t2i_agents.
Related papers
- TIIF-Bench: How Does Your T2I Model Follow Your Instructions? [7.13169573900556]
We present TIIF-Bench (Text-to-Image Instruction Following Benchmark), aiming to systematically assess T2I models' ability in interpreting and following intricate textual instructions.<n> TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions, which are categorized into three levels of difficulties and complexities.<n>Two critical attributes, i.e. text rendering and style control, are introduced to evaluate the precision of text synthesis and the aesthetic coherence of T2I models.
arXiv Detail & Related papers (2025-06-02T18:44:07Z) - IA-T2I: Internet-Augmented Text-to-Image Generation [13.765327654914199]
Current text-to-image (T2I) generation models achieve promising results, but they fail on the scenarios where the knowledge implied in the text prompt is uncertain.<n>We propose an Internet-Augmented text-to-image generation (IA-T2I) framework to compel T2I models clear about such uncertain knowledge by providing them with reference images.
arXiv Detail & Related papers (2025-05-21T17:31:49Z) - Replace in Translation: Boost Concept Alignment in Counterfactual Text-to-Image [53.09546752700792]
We propose a strategy to instruct this replacing process, which is called as Explicit Logical Narrative Prompt (ELNP)<n>We design a metric to calculate how many required concepts in the prompt can be covered averagely in the synthesized images.<n>The extensive experiments and qualitative comparisons demonstrate that our strategy can boost the concept alignment in counterfactual T2I.
arXiv Detail & Related papers (2025-05-20T13:27:52Z) - Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent [9.748808189341526]
An effective Text-to-Image (T2I) evaluation metric should accomplish the following: detect instances where the generated images do not align with the textual prompts.
We propose a method based on large language models (LLMs) for conducting question-answering with an extracted scene-graph and created a dataset with human-rated scores for generated images.
arXiv Detail & Related papers (2024-12-07T18:44:38Z) - ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting [18.002126814513417]
ChatGen-Evo is a multi-stage evolution strategy that progressively equips models with essential automation skills.
ChatGen-Evo significantly enhances performance over various baselines.
arXiv Detail & Related papers (2024-11-26T07:31:12Z) - Text-to-Image Synthesis: A Decade Survey [7.250878248686215]
Text-to-image synthesis (T2I) focuses on generating high-quality images from textual descriptions.
In this survey, we review over 440 recent works on T2I.
arXiv Detail & Related papers (2024-11-25T07:40:32Z) - Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping [55.98643055756135]
We introduce Sketch2Code, a benchmark that evaluates state-of-the-art Vision Language Models (VLMs) on automating the conversion of rudimentary sketches into webpage prototypes.
We analyze ten commercial and open-source models, showing that Sketch2Code is challenging for existing VLMs.
A user study with UI/UX experts reveals a significant preference for proactive question-asking over passive feedback reception.
arXiv Detail & Related papers (2024-10-21T17:39:49Z) - Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? [97.0899853256201]
We present a novel task and benchmark for evaluating the ability of text-to-image generation models to produce images that align with commonsense in real life.
We evaluate whether T2I models can conduct visual-commonsense reasoning, e.g. produce images that fit "the lightbulb is unlit" vs. "the lightbulb is lit"
We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos.
arXiv Detail & Related papers (2024-06-11T17:59:48Z) - DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model [90.71963723884944]
Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research.
We introduce DiffAgent, an agent designed to screen the accurate selection in seconds via API calls.
Our evaluations reveal that DiffAgent not only excels in identifying the appropriate T2I API but also underscores the effectiveness of the SFTA training framework.
arXiv Detail & Related papers (2024-03-31T06:28:15Z) - SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z) - Position: Towards Implicit Prompt For Text-To-Image Models [57.00716011456852]
This paper highlights the current state of text-to-image (T2I) models toward implicit prompts.
We present a benchmark named ImplicitBench and conduct an investigation on the performance and impacts of implicit prompts.
Experiment results show that T2I models are able to accurately create various target symbols indicated by implicit prompts.
arXiv Detail & Related papers (2024-03-04T15:21:51Z) - Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation [19.06501699814924]
We build the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing implicitly adversarial prompts.
The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models.
We find that 14% of images that humans consider harmful are mislabeled as safe'' by machines.
arXiv Detail & Related papers (2024-02-14T22:21:12Z) - Automated Testing for Text-to-Image Software [0.0]
ACTesting is an automated cross-modal testing method for text-to-image (T2I) software.
We show that ACTesting can generate error-revealing tests, reducing the text-image consistency by up to 20% compared with the baseline.
The results demonstrate that ACTesting can identify abnormal behaviors of T2I software effectively.
arXiv Detail & Related papers (2023-12-20T11:19:23Z) - Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation [115.63085345822175]
We introduce Idea to Image'', a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation.
We investigate if systems based on large multimodal models (LMMs) can develop analogous multimodal self-refinement abilities.
arXiv Detail & Related papers (2023-10-12T17:34:20Z) - INSCIT: Information-Seeking Conversations with Mixed-Initiative
Interactions [47.90088587508672]
InSCIt is a dataset for Information-Seeking Conversations with mixed-initiative Interactions.
It contains 4.7K user-agent turns from 805 human-human conversations.
We report results of two systems based on state-of-the-art models of conversational knowledge identification and open-domain question answering.
arXiv Detail & Related papers (2022-07-02T06:18:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.