DiffChat: Learning to Chat with Text-to-Image Synthesis Models for
Interactive Image Creation
- URL: http://arxiv.org/abs/2403.04997v1
- Date: Fri, 8 Mar 2024 02:24:27 GMT
- Title: DiffChat: Learning to Chat with Text-to-Image Synthesis Models for
Interactive Image Creation
- Authors: Jiapeng Wang, Chengyu Wang, Tingfeng Cao, Jun Huang, Lianwen Jin
- Abstract summary: We present DiffChat, a novel method to align Large Language Models (LLMs) to "chat" with prompt-as-input Text-to-Image Synthesis (TIS) models for interactive image creation.
Given a raw prompt/image and a user-specified instruction, DiffChat can effectively make appropriate modifications and generate the target prompt.
- Score: 40.478839423995296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present DiffChat, a novel method to align Large Language Models (LLMs) to
"chat" with prompt-as-input Text-to-Image Synthesis (TIS) models (e.g., Stable
Diffusion) for interactive image creation. Given a raw prompt/image and a
user-specified instruction, DiffChat can effectively make appropriate
modifications and generate the target prompt, which can be leveraged to create
the target image of high quality. To achieve this, we first collect an
instruction-following prompt engineering dataset named InstructPE for the
supervised training of DiffChat. Next, we propose a reinforcement learning
framework with the feedback of three core criteria for image creation, i.e.,
aesthetics, user preference, and content integrity. It involves an action-space
dynamic modification technique to obtain more relevant positive samples and
harder negative samples during the off-policy sampling. Content integrity is
also introduced into the value estimation function for further improvement of
produced images. Our method can exhibit superior performance than baseline
models and strong competitors based on both automatic and human evaluations,
which fully demonstrates its effectiveness.
Related papers
- FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting [18.708185548091716]
FRAP is a simple, yet effective approach based on adaptively adjusting the per-token prompt weights to improve prompt-image alignment and authenticity of the generated images.
We show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets.
We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment.
arXiv Detail & Related papers (2024-08-21T15:30:35Z) - JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset.
We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model.
Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z) - Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions.
Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image.
We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z) - Batch-Instructed Gradient for Prompt Evolution:Systematic Prompt Optimization for Enhanced Text-to-Image Synthesis [3.783530340696776]
This study proposes a Multi-Agent framework to optimize input prompts for text-to-image generation models.
A professional prompts database serves as a benchmark to guide the instruction modifier towards generating high-caliber prompts.
Preliminary ablation studies highlight the effectiveness of various system components and suggest areas for future improvements.
arXiv Detail & Related papers (2024-06-13T00:33:29Z) - Dynamic Prompt Optimizing for Text-to-Image Generation [63.775458908172176]
We introduce the textbfPrompt textbfAuto-textbfEditing (PAE) method to improve text-to-image generative models.
We employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts.
arXiv Detail & Related papers (2024-04-05T13:44:39Z) - A User-Friendly Framework for Generating Model-Preferred Prompts in
Text-to-Image Synthesis [33.71897211776133]
Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images.
It is challenging for novice users to achieve the desired results by manually entering prompts.
We propose a novel framework that automatically translates user-input prompts into model-preferred prompts.
arXiv Detail & Related papers (2024-02-20T06:58:49Z) - BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image
Synthesis [14.852061933308276]
We propose BeautifulPrompt, a deep generative model to produce high-quality prompts from very simple raw descriptions.
In our work, we first fine-tuned the BeautifulPrompt model over low-quality and high-quality collecting prompt pairs.
We further showcase the integration of BeautifulPrompt to a cloud-native AI platform to provide better text-to-image generation service.
arXiv Detail & Related papers (2023-11-12T06:39:00Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual
Representation Learners [58.941838860425754]
We show that training self-supervised methods on synthetic images can match or beat the real image counterpart.
We develop a multi-positive contrastive learning method, which we call StableRep.
With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP.
arXiv Detail & Related papers (2023-06-01T17:59:51Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.