AutoStory: Generating Diverse Storytelling Images with Minimal Human
Effort
- URL: http://arxiv.org/abs/2311.11243v1
- Date: Sun, 19 Nov 2023 06:07:37 GMT
- Title: AutoStory: Generating Diverse Storytelling Images with Minimal Human
Effort
- Authors: Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, Chunhua
Shen
- Abstract summary: We propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images.
We utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images.
- Score: 55.83007338095763
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Story visualization aims to generate a series of images that match the story
described in texts, and it requires the generated images to satisfy high
quality, alignment with the text description, and consistency in character
identities. Given the complexity of story visualization, existing methods
drastically simplify the problem by considering only a few specific characters
and scenarios, or requiring the users to provide per-image control conditions
such as sketches. However, these simplifications render these methods
incompetent for real applications. To this end, we propose an automated story
visualization system that can effectively generate diverse, high-quality, and
consistent sets of story images, with minimal human interactions. Specifically,
we utilize the comprehension and planning capabilities of large language models
for layout planning, and then leverage large-scale text-to-image models to
generate sophisticated story images based on the layout. We empirically find
that sparse control conditions, such as bounding boxes, are suitable for layout
planning, while dense control conditions, e.g., sketches and keypoints, are
suitable for generating high-quality image content. To obtain the best of both
worlds, we devise a dense condition generation module to transform simple
bounding box layouts into sketch or keypoint control conditions for final image
generation, which not only improves the image quality but also allows easy and
intuitive user interactions. In addition, we propose a simple yet effective
method to generate multi-view consistent character images, eliminating the
reliance on human labor to collect or draw character images.
Related papers
- Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions.
LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z) - LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures
in Text-to-Image Generation [18.396131717250793]
We introduce GlyphDraw, a general learning framework aiming to endow image generation models with the capacity to generate images coherently embedded with text for any specific language.
Our method not only produces accurate language characters as in prompts, but also seamlessly blends the generated text into the background.
arXiv Detail & Related papers (2023-03-31T08:06:33Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.