Towards Smooth Video Composition
- URL: http://arxiv.org/abs/2212.07413v1
- Date: Wed, 14 Dec 2022 18:54:13 GMT
- Title: Towards Smooth Video Composition
- Authors: Qihang Zhang, Ceyuan Yang, Yujun Shen, Yinghao Xu, and Bolei Zhou
- Abstract summary: Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
- Score: 59.134911550142455
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video generation requires synthesizing consistent and persistent frames with
dynamic content over time. This work investigates modeling the temporal
relations for composing video with arbitrary length, from a few frames to even
infinite, using generative adversarial networks (GANs). First, towards
composing adjacent frames, we show that the alias-free operation for single
image generation, together with adequately pre-learned knowledge, brings a
smooth frame transition without compromising the per-frame quality. Second, by
incorporating the temporal shift module (TSM), originally designed for video
understanding, into the discriminator, we manage to advance the generator in
synthesizing more consistent dynamics. Third, we develop a novel B-Spline based
motion representation to ensure temporal smoothness to achieve infinite-length
video generation. It can go beyond the frame number used in training. A
low-rank temporal modulation is also proposed to alleviate repeating contents
for long video generation. We evaluate our approach on various datasets and
show substantial improvements over video generation baselines. Code and models
will be publicly available at https://genforce.github.io/StyleSV.
Related papers
- ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models [66.84478240757038]
A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip.
We introduce causal (i.e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames.
Our ViD-GPT achieves state-of-the-art performance both quantitatively and qualitatively on long video generation.
arXiv Detail & Related papers (2024-06-16T15:37:22Z) - LoopAnimate: Loopable Salient Object Animation [19.761865029125524]
LoopAnimate is a novel method for generating videos with consistent start and end frames.
It achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.
arXiv Detail & Related papers (2024-04-14T07:36:18Z) - Explorative Inbetweening of Time and Space [46.77750028273578]
We introduce bounded generation to control video generation based only on a given start and end frame.
Time Reversal Fusion fuses the temporally forward and backward denoising paths conditioned on the start and end frame.
We find that Time Reversal Fusion outperforms related work on all subtasks.
arXiv Detail & Related papers (2024-03-21T17:57:31Z) - Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM
Animator [59.589919015669274]
This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient.
We propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence.
We also propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path.
arXiv Detail & Related papers (2023-09-25T19:42:16Z) - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM.
We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z) - ControlVideo: Training-free Controllable Text-to-Video Generation [117.06302461557044]
ControlVideo is a framework to enable natural and efficient text-to-video generation.
It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
arXiv Detail & Related papers (2023-05-22T14:48:53Z) - Strumming to the Beat: Audio-Conditioned Contrastive Video Textures [112.6140796961121]
We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning.
We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order.
Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.
arXiv Detail & Related papers (2021-04-06T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.