Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models
- URL: http://arxiv.org/abs/2306.00973v3
- Date: Mon, 4 Mar 2024 10:53:18 GMT
- Title: Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models
- Authors: Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yanfeng Wang, Weidi
Xie
- Abstract summary: We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
- Score: 70.86603627188519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative models have recently exhibited exceptional capabilities in
text-to-image generation, but still struggle to generate image sequences
coherently. In this work, we focus on a novel, yet challenging task of
generating a coherent image sequence based on a given storyline, denoted as
open-ended visual storytelling. We make the following three contributions: (i)
to fulfill the task of visual storytelling, we propose a learning-based
auto-regressive image generation model, termed as StoryGen, with a novel
vision-language context module, that enables to generate the current frame by
conditioning on the corresponding text prompt and preceding image-caption
pairs; (ii) to address the data shortage of visual storytelling, we collect
paired image-text sequences by sourcing from online videos and open-source
E-books, establishing processing pipeline for constructing a large-scale
dataset with diverse characters, storylines, and artistic styles, named
StorySalon; (iii) Quantitative experiments and human evaluations have validated
the superiority of our StoryGen, where we show StoryGen can generalize to
unseen characters without any optimization, and generate image sequences with
coherent content and consistent character. Code, dataset, and models are
available at https://haoningwu3639.github.io/StoryGen_Webpage/
Related papers
- Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling [81.69474860607542]
We present Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text.
We also present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided.
arXiv Detail & Related papers (2024-08-07T11:20:37Z) - TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - Visual Storytelling with Question-Answer Plans [70.89011289754863]
We present a novel framework which integrates visual representations with pretrained language models and planning.
Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret.
It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative.
arXiv Detail & Related papers (2023-10-08T21:45:34Z) - Zero-shot Generation of Coherent Storybook from Plain Text Story using
Diffusion Models [43.32978092618245]
We present a novel neural pipeline for generating a coherent storybook from the plain text of a story.
We leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images.
arXiv Detail & Related papers (2023-02-08T06:24:06Z) - Visual Writing Prompts: Character-Grounded Story Generation with Curated
Image Sequences [67.61940880927708]
Current work on image-based story generation suffers from the fact that the existing image sequence collections do not have coherent plots behind them.
We improve visual story generation by producing a new image-grounded dataset, Visual Writing Prompts (VWP).
VWP contains almost 2K selected sequences of movie shots, each including 5-10 images.
The image sequences are aligned with a total of 12K stories which were collected via crowdsourcing given the image sequences and a set of grounded characters from the corresponding image sequence.
arXiv Detail & Related papers (2023-01-20T13:38:24Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.