Improving Text-to-Image Consistency via Automatic Prompt Optimization
- URL: http://arxiv.org/abs/2403.17804v1
- Date: Tue, 26 Mar 2024 15:42:01 GMT
- Title: Improving Text-to-Image Consistency via Automatic Prompt Optimization
- Authors: Oscar MaƱas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, Michal Drozdzal,
- Abstract summary: We introduce a T2I optimization-by-prompting framework, OPT2I, to improve prompt-image consistency in T2I models.
Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score.
- Score: 26.2587505265501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.
Related papers
- PromptLA: Towards Integrity Verification of Black-box Text-to-Image Diffusion Models [16.67563247104523]
Current text-to-image (T2I) diffusion models can produce high-quality images.
Malicious users who are authorized to use the model only for benign purposes might modify their models to generate images that result in harmful social impacts.
We propose a novel prompt selection algorithm for efficient and accurate integrity verification of T2I diffusion models.
arXiv Detail & Related papers (2024-12-20T07:24:32Z) - Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG [6.701537544179892]
We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating a graph-based RAG.
Our system dynamically retrieves detailed character information and relational data from the knowledge graph, enabling the generation of visually accurate and contextually rich images.
arXiv Detail & Related papers (2024-12-12T18:59:41Z) - GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis [10.47359822447001]
We present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps.
Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models.
arXiv Detail & Related papers (2024-12-08T22:29:56Z) - Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - TIPO: Text to Image with Text Presampling for Prompt Optimization [16.001151202788304]
TIPO is an innovative framework designed to enhance text-to-image (T2I) generation by language model (LM)
Unlike previous approaches that rely on Large Language Models (LLMs) or reinforcement learning (RL), TIPO adjusts user input prompts with the distribution of a trained prompt dataset.
arXiv Detail & Related papers (2024-11-12T19:09:45Z) - FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting [18.708185548091716]
FRAP is a simple, yet effective approach based on adaptively adjusting the per-token prompt weights to improve prompt-image alignment and authenticity of the generated images.
We show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets.
We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment.
arXiv Detail & Related papers (2024-08-21T15:30:35Z) - ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising.
We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z) - SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z) - If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt.
We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts.
We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - TIME: Text and Image Mutual-Translation Adversarial Networks [55.1298552773457]
We propose Text and Image Mutual-Translation Adversarial Networks (TIME)
TIME learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework.
In experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB and MS-COCO dataset.
arXiv Detail & Related papers (2020-05-27T06:40:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.