Mini-DALLE3: Interactive Text to Image by Prompting Large Language
Models
- URL: http://arxiv.org/abs/2310.07653v2
- Date: Thu, 12 Oct 2023 00:54:56 GMT
- Title: Mini-DALLE3: Interactive Text to Image by Prompting Large Language
Models
- Authors: Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang
- Abstract summary: A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions.
Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I)
We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
- Score: 71.49054220807983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The revolution of artificial intelligence content generation has been rapidly
accelerated with the booming text-to-image (T2I) diffusion models. Within just
two years of development, it was unprecedentedly of high-quality, diversity,
and creativity that the state-of-the-art models could generate. However, a
prevalent limitation persists in the effective communication with these popular
T2I models, such as Stable Diffusion, using natural language descriptions. This
typically makes an engaging image hard to obtain without expertise in prompt
engineering with complex word compositions, magic tags, and annotations.
Inspired by the recently released DALLE3 - a T2I model directly built-in
ChatGPT that talks human language, we revisit the existing T2I systems
endeavoring to align human intent and introduce a new task - interactive text
to image (iT2I), where people can interact with LLM for interleaved
high-quality image generation/edit/refinement and question answering with
stronger images and text correspondences using natural language. In addressing
the iT2I problem, we present a simple approach that augments LLMs for iT2I with
prompting techniques and off-the-shelf T2I models. We evaluate our approach for
iT2I in a variety of common-used scenarios under different LLMs, e.g., ChatGPT,
LLAMA, Baichuan, and InternLM. We demonstrate that our approach could be a
convenient and low-cost way to introduce the iT2I ability for any existing LLMs
and any text-to-image models without any training while bringing little
degradation on LLMs' inherent capabilities in, e.g., question answering and
code generation. We hope this work could draw broader attention and provide
inspiration for boosting user experience in human-machine interactions
alongside the image quality of the next-generation T2I systems.
Related papers
- MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis [18.876109299162138]
We introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE)
This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information, freezing the textual component while fine-tuning the visual component.
MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks.
arXiv Detail & Related papers (2024-07-10T12:52:49Z) - ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis [6.066100464517522]
We introduce the Abstractive News Captions with High-level cOntext Representation dataset, containing 70K+ samples sourced from 5 different news media organizations.
Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights.
It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR.
arXiv Detail & Related papers (2024-04-15T21:19:10Z) - Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation [150.57983348059528]
PRISM is an algorithm that automatically identifies human-interpretable and transferable prompts.
It can effectively generate desired concepts given only black-box access to T2I models.
Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles and images.
arXiv Detail & Related papers (2024-03-28T02:35:53Z) - SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z) - DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot
Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos.
We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training.
The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z) - GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation [143.81719619351335]
Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions.
The tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade.
We propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model.
arXiv Detail & Related papers (2023-03-17T15:37:07Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - TIME: Text and Image Mutual-Translation Adversarial Networks [55.1298552773457]
We propose Text and Image Mutual-Translation Adversarial Networks (TIME)
TIME learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework.
In experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB and MS-COCO dataset.
arXiv Detail & Related papers (2020-05-27T06:40:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.