Zero-shot Generation of Coherent Storybook from Plain Text Story using
Diffusion Models
- URL: http://arxiv.org/abs/2302.03900v1
- Date: Wed, 8 Feb 2023 06:24:06 GMT
- Title: Zero-shot Generation of Coherent Storybook from Plain Text Story using
Diffusion Models
- Authors: Hyeonho Jeong, Gihyun Kwon, Jong Chul Ye
- Abstract summary: We present a novel neural pipeline for generating a coherent storybook from the plain text of a story.
We leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images.
- Score: 43.32978092618245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in large scale text-to-image models have opened new
possibilities for guiding the creation of images through human-devised natural
language. However, while prior literature has primarily focused on the
generation of individual images, it is essential to consider the capability of
these models to ensure coherency within a sequence of images to fulfill the
demands of real-world applications such as storytelling. To address this, here
we present a novel neural pipeline for generating a coherent storybook from the
plain text of a story. Specifically, we leverage a combination of a pre-trained
Large Language Model and a text-guided Latent Diffusion Model to generate
coherent images. While previous story synthesis frameworks typically require a
large-scale text-to-image model trained on expensive image-caption pairs to
maintain the coherency, we employ simple textual inversion techniques along
with detector-based semantic image editing which allows zero-shot generation of
the coherent storybook. Experimental results show that our proposed method
outperforms state-of-the-art image editing baselines.
Related papers
- Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.