TemporalStory: Enhancing Consistency in Story Visualization using Spatial-Temporal Attention
- URL: http://arxiv.org/abs/2407.09774v1
- Date: Sat, 13 Jul 2024 05:02:42 GMT
- Title: TemporalStory: Enhancing Consistency in Story Visualization using Spatial-Temporal Attention
- Authors: Sixiao Zheng, Yanwei Fu,
- Abstract summary: We introduce TemporalStory, a text-to-image generation tool based on spatial-temporal attention to dependencies in images.
We also introduce a text adapter capable of integrating information from other sentences into the embedding of the current sentence.
Our TemporalStory outperforms the previous state-of-the-art in both story visualization and story continuation tasks.
- Score: 50.572907418430155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Story visualization presents a challenging task in text-to-image generation, requiring not only the rendering of visual details from text prompt but also ensuring consistency across images. Recently, most approaches address inconsistency problem using an auto-regressive manner conditioned on previous image-sentence pairs. However, they overlook the fact that story context is dispersed across all sentences. The auto-regressive approach fails to encode information from susequent image-sentence pairs, thus unable to capture the entirety of the story context. To address this, we introduce TemporalStory, leveraging Spatial-Temporal attention to model complex spatial and temporal dependencies in images, enabling the generation of coherent images based on a given storyline. In order to better understand the storyline context, we introduce a text adapter capable of integrating information from other sentences into the embedding of the current sentence. Additionally, to utilize scene changes between story images as guidance for the model, we propose the StoryFlow Adapter to measure the degree of change between images. Through extensive experiments on two popular benchmarks, PororoSV and FlintstonesSV, our TemporalStory outperforms the previous state-of-the-art in both story visualization and story continuation tasks.
Related papers
- Story-Adapter: A Training-free Iterative Framework for Long Story Visualization [14.303607837426126]
We propose a training-free and computationally efficient framework, termed Story-Adapter, to enhance the generative capability of long stories.
Central to our framework is a training-free global reference cross-attention module, which aggregates all generated images from the previous iteration.
Experiments validate the superiority of Story-Adapter in improving both semantic consistency and generative capability for fine-grained interactions.
arXiv Detail & Related papers (2024-10-08T17:59:30Z) - Generating Visual Stories with Grounded and Coreferent Characters [63.07511918366848]
We present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions.
Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark.
We also propose new evaluation metrics to measure the richness of characters and coreference in stories.
arXiv Detail & Related papers (2024-09-20T14:56:33Z) - Make-A-Storyboard: A General Framework for Storyboard with Disentangled
and Merged Control [131.1446077627191]
We propose a new presentation form for Story Visualization called Storyboard, inspired by film-making.
Within each scene in Storyboard, characters engage in activities at the same location, necessitating both visually consistent scenes and characters.
Our method could be seamlessly integrated into mainstream Image Customization methods, empowering them with the capability of story visualization.
arXiv Detail & Related papers (2023-12-06T12:16:23Z) - Causal-Story: Local Causal Attention Utilizing Parameter-Efficient
Tuning For Visual Story Synthesis [12.766712398098646]
We propose Causal-Story, which considers the causal relationship between previous captions, frames, and current captions.
We evaluate our model on the PororoSV and FlintstonesSV datasets and obtained state-of-the-art FID scores.
arXiv Detail & Related papers (2023-09-18T08:06:06Z) - Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem.
We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z) - Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context.
Our method outperforms prior state-of-the-art in generating frames with high visual quality.
Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.