Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation
- URL: http://arxiv.org/abs/2304.08477v2
- Date: Tue, 18 Apr 2023 03:27:52 GMT
- Title: Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation
- Authors: Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo
Luo, Xi Yin
- Abstract summary: Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model.
We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
- Score: 115.09597127418452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Latent-Shift -- an efficient text-to-video generation method based
on a pretrained text-to-image generation model that consists of an autoencoder
and a U-Net diffusion model. Learning a video diffusion model in the latent
space is much more efficient than in the pixel space. The latter is often
limited to first generating a low-resolution video followed by a sequence of
frame interpolation and super-resolution models, which makes the entire
pipeline very complex and computationally expensive. To extend a U-Net from
image generation to video generation, prior work proposes to add additional
modules like 1D temporal convolution and/or temporal attention layers. In
contrast, we propose a parameter-free temporal shift module that can leverage
the spatial U-Net as is for video generation. We achieve this by shifting two
portions of the feature map channels forward and backward along the temporal
dimension. The shifted features of the current frame thus receive the features
from the previous and the subsequent frames, enabling motion learning without
additional parameters. We show that Latent-Shift achieves comparable or better
results while being significantly more efficient. Moreover, Latent-Shift can
generate images despite being finetuned for T2V generation.
Related papers
- ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.
We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.
Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z) - Decouple Content and Motion for Conditional Image-to-Video Generation [6.634105805557556]
conditional image-to-video (cI2V) generation is to create a believable new video by beginning with the condition, i.e., one image and text.
Previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity.
We propose a novel approach by disentangling the target RGB pixels into two distinct components: spatial content and temporal motions.
arXiv Detail & Related papers (2023-11-24T06:08:27Z) - MoVideo: Motion-Aware Video Generation with Diffusion Models [97.03352319694795]
We propose a novel motion-aware generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow.
MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
arXiv Detail & Related papers (2023-11-19T13:36:03Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - Align your Latents: High-Resolution Video Synthesis with Latent
Diffusion Models [71.11425812806431]
Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands.
Here, we apply the LDM paradigm to high-resolution generation, a particularly resource-intensive task.
We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling.
arXiv Detail & Related papers (2023-04-18T08:30:32Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Decoupled Spatial-Temporal Transformer for Video Inpainting [77.8621673355983]
Video aims to fill the given holes with realistic appearance but is still a challenging task even with prosperous deep learning approaches.
Recent works introduce the promising Transformer architecture into deep video inpainting and achieve better performance.
We propose a Decoupled Spatial-Temporal Transformer (DSTT) for improving video inpainting with exceptional efficiency.
arXiv Detail & Related papers (2021-04-14T05:47:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.