StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation
- URL: http://arxiv.org/abs/2209.06192v1
- Date: Tue, 13 Sep 2022 17:47:39 GMT
- Title: StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation
- Authors: Adyasha Maharana, Darryl Hannan, and Mohit Bansal
- Abstract summary: We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
- Score: 76.44802273236081
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in text-to-image synthesis have led to large pretrained
transformers with excellent capabilities to generate visualizations from a
given text. However, these models are ill-suited for specialized tasks like
story visualization, which requires an agent to produce a sequence of images
given a corresponding sequence of captions, forming a narrative. Moreover, we
find that the story visualization task fails to accommodate generalization to
unseen plots and characters in new narratives. Hence, we first propose the task
of story continuation, where the generated visual story is conditioned on a
source image, allowing for better generalization to narratives with new
characters. Then, we enhance or 'retro-fit' the pretrained text-to-image
synthesis models with task-specific modules for (a) sequential image generation
and (b) copying relevant elements from an initial frame. Then, we explore
full-model finetuning, as well as prompt-based tuning for parameter-efficient
adaptation, of the pre-trained model. We evaluate our approach StoryDALL-E on
two existing datasets, PororoSV and FlintstonesSV, and introduce a new dataset
DiDeMoSV collected from a video-captioning dataset. We also develop a model
StoryGANc based on Generative Adversarial Networks (GAN) for story
continuation, and compare it with the StoryDALL-E model to demonstrate the
advantages of our approach. We show that our retro-fitting approach outperforms
GAN-based models for story continuation and facilitates copying of visual
elements from the source image, thereby improving continuity in the generated
visual story. Finally, our analysis suggests that pretrained transformers
struggle to comprehend narratives containing several characters. Overall, our
work demonstrates that pretrained text-to-image synthesis models can be adapted
for complex and low-resource tasks like story continuation.
Related papers
- TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - Let Storytelling Tell Vivid Stories: An Expressive and Fluent Multimodal
Storyteller [21.953766228135827]
We propose a new pipeline, termed LLaMS, to generate multimodal human-level stories.
We first employ a sequence data auto-enhancement strategy to enhance factual content expression.
Secondly, we propose SQ-Adatpter module for story illustration generation which can maintain sequence consistency.
arXiv Detail & Related papers (2024-03-12T04:07:00Z) - Visual Storytelling with Question-Answer Plans [70.89011289754863]
We present a novel framework which integrates visual representations with pretrained language models and planning.
Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret.
It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative.
arXiv Detail & Related papers (2023-10-08T21:45:34Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - Improved Visual Story Generation with Adaptive Context Modeling [39.04249009170821]
We present a simple method that improves the leading system with adaptive context modeling.
We evaluate our model on PororoSV and FlintstonesSV datasets and show that our approach achieves state-of-the-art FID scores on both story visualization and continuation scenarios.
arXiv Detail & Related papers (2023-05-26T10:43:42Z) - Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context.
Our method outperforms prior state-of-the-art in generating frames with high visual quality.
Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling [81.33107307509718]
We propose a topic adaptive storyteller to model the ability of inter-topic generalization.
We also propose a prototype encoding structure to model the ability of intra-topic derivation.
Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model.
arXiv Detail & Related papers (2020-08-11T03:55:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.