ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context
- URL: http://arxiv.org/abs/2407.09774v2
- Date: Wed, 21 Aug 2024 14:17:31 GMT
- Title: ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context
- Authors: Sixiao Zheng, Yanwei Fu,
- Abstract summary: Existing autoregressive methods struggle with high memory usage, slow generation speeds, and limited context integration.
We propose ContextualStory, a novel framework designed to generate coherent story frames and extend frames for story continuation.
In experiments on PororoSV and FlintstonesSV benchmarks, ContextualStory significantly outperforms existing methods in both story visualization and story continuation.
- Score: 50.572907418430155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual storytelling involves generating a sequence of coherent frames from a textual storyline while maintaining consistency in characters and scenes. Existing autoregressive methods, which rely on previous frame-sentence pairs, struggle with high memory usage, slow generation speeds, and limited context integration. To address these issues, we propose ContextualStory, a novel framework designed to generate coherent story frames and extend frames for story continuation. ContextualStory utilizes Spatially-Enhanced Temporal Attention to capture spatial and temporal dependencies, handling significant character movements effectively. Additionally, we introduces a Storyline Contextualizer to enrich context in storyline embedding and a StoryFlow Adapter to measure scene changes between frames for guiding model. Extensive experiments on PororoSV and FlintstonesSV benchmarks demonstrate that ContextualStory significantly outperforms existing methods in both story visualization and story continuation.
Related papers
- Story-Adapter: A Training-free Iterative Framework for Long Story Visualization [14.303607837426126]
We propose a training-free and computationally efficient framework, termed Story-Adapter, to enhance the generative capability of long stories.
Central to our framework is a training-free global reference cross-attention module, which aggregates all generated images from the previous iteration.
Experiments validate the superiority of Story-Adapter in improving both semantic consistency and generative capability for fine-grained interactions.
arXiv Detail & Related papers (2024-10-08T17:59:30Z) - Generating Visual Stories with Grounded and Coreferent Characters [63.07511918366848]
We present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions.
Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark.
We also propose new evaluation metrics to measure the richness of characters and coreference in stories.
arXiv Detail & Related papers (2024-09-20T14:56:33Z) - Make-A-Storyboard: A General Framework for Storyboard with Disentangled
and Merged Control [131.1446077627191]
We propose a new presentation form for Story Visualization called Storyboard, inspired by film-making.
Within each scene in Storyboard, characters engage in activities at the same location, necessitating both visually consistent scenes and characters.
Our method could be seamlessly integrated into mainstream Image Customization methods, empowering them with the capability of story visualization.
arXiv Detail & Related papers (2023-12-06T12:16:23Z) - Causal-Story: Local Causal Attention Utilizing Parameter-Efficient
Tuning For Visual Story Synthesis [12.766712398098646]
We propose Causal-Story, which considers the causal relationship between previous captions, frames, and current captions.
We evaluate our model on the PororoSV and FlintstonesSV datasets and obtained state-of-the-art FID scores.
arXiv Detail & Related papers (2023-09-18T08:06:06Z) - Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem.
We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z) - Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context.
Our method outperforms prior state-of-the-art in generating frames with high visual quality.
Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.