Related papers: StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

URL: http://arxiv.org/abs/2405.01434v1
Date: Thu, 2 May 2024 16:25:16 GMT
Title: StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation
Authors: Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou,
Abstract summary: We propose a new way of self-attention calculation, termed Consistent Self-Attention. To extend our method to long-range video generation, we introduce a novel semantic space temporal motion prediction module. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos.
Score: 117.13475564834458
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at https://github.com/HVision-NKU/StoryDiffusion.

Related papers

Phantom: Subject-consistent video generation via cross-modal alignment [16.777805813950486]
We propose a unified video generation framework for both single- and multi-subject references. The proposed method achieves high-fidelity subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion.
arXiv Detail & Related papers (2025-02-16T11:02:50Z)
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering. We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z)
LatentMan: Generating Consistent Animated Characters using Image Diffusion Models [44.18315132571804]
We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models. Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference.
arXiv Detail & Related papers (2023-12-12T10:07:37Z)
MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user. Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process. Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z)
MoVideo: Motion-Aware Video Generation with Diffusion Models [97.03352319694795]
We propose a novel motion-aware generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
arXiv Detail & Related papers (2023-11-19T13:36:03Z)
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z)
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising [43.35391175319815]
This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos. We introduce a novel paradigm dubbed Gen-L-Video, capable of extending off-the-shelf short video diffusion models. Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models.
arXiv Detail & Related papers (2023-05-29T17:38:18Z)
Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects. Our framework is a non-trivial adaptation from image generation methods, and is new to this field. Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z)
Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time. This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs) We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.