Related papers: Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

URL: http://arxiv.org/abs/2410.06244v1
Date: Tue, 8 Oct 2024 17:59:30 GMT
Title: Story-Adapter: A Training-free Iterative Framework for Long Story Visualization
Authors: Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Yuyin Zhou,
Abstract summary: We propose a training-free and computationally efficient framework, termed Story-Adapter, to enhance the generative capability of long stories. Central to our framework is a training-free global reference cross-attention module, which aggregates all generated images from the previous iteration. Experiments validate the superiority of Story-Adapter in improving both semantic consistency and generative capability for fine-grained interactions.
Score: 14.303607837426126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Story visualization, the task of generating coherent images based on a narrative, has seen significant advancements with the emergence of text-to-image models, particularly diffusion models. However, maintaining semantic consistency, generating high-quality fine-grained interactions, and ensuring computational feasibility remain challenging, especially in long story visualization (i.e., up to 100 frames). In this work, we propose a training-free and computationally efficient framework, termed Story-Adapter, to enhance the generative capability of long stories. Specifically, we propose an iterative paradigm to refine each generated image, leveraging both the text prompt and all generated images from the previous iteration. Central to our framework is a training-free global reference cross-attention module, which aggregates all generated images from the previous iteration to preserve semantic consistency across the entire story, while minimizing computational costs with global embeddings. This iterative process progressively optimizes image generation by repeatedly incorporating text constraints, resulting in more precise and fine-grained interactions. Extensive experiments validate the superiority of Story-Adapter in improving both semantic consistency and generative capability for fine-grained interactions, particularly in long story scenarios. The project page and associated code can be accessed via https://jwmao1.github.io/storyadapter .

Related papers

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling [81.69474860607542]
We present Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. We also present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided.
arXiv Detail & Related papers (2024-08-07T11:20:37Z)
StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion [78.1014542102578]
Story visualization aims to generate realistic and coherent images based on a storyline. Current models adopt a frame-by-frame architecture by transforming the pre-trained text-to-image model into an auto-regressive manner. We propose a bidirectional, unified, and efficient framework, namely StoryImager.
arXiv Detail & Related papers (2024-04-09T03:22:36Z)
Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling. We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module. We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z)
Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models [43.32978092618245]
We present a novel neural pipeline for generating a coherent storybook from the plain text of a story. We leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images.
arXiv Detail & Related papers (2023-02-08T06:24:06Z)
Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context. Our method outperforms prior state-of-the-art in generating frames with high visual quality. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z)
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image. We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image. Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.