DiffSynth: Latent In-Iteration Deflickering for Realistic Video
Synthesis
- URL: http://arxiv.org/abs/2308.03463v3
- Date: Thu, 10 Aug 2023 02:26:16 GMT
- Title: DiffSynth: Latent In-Iteration Deflickering for Realistic Video
Synthesis
- Authors: Zhongjie Duan, Lizhou You, Chengyu Wang, Cen Chen, Ziheng Wu, Weining
Qian, Jun Huang
- Abstract summary: DiffSynth is a novel approach to convert image synthesis pipelines to video synthesis pipelines.
It consists of a latent in-it deflickering framework and a video deflickering algorithm.
One of the notable advantages of Diff Synth is its general applicability to various video synthesis tasks.
- Score: 15.857449277106827
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, diffusion models have emerged as the most powerful approach
in image synthesis. However, applying these models directly to video synthesis
presents challenges, as it often leads to noticeable flickering contents.
Although recently proposed zero-shot methods can alleviate flicker to some
extent, we still struggle to generate coherent videos. In this paper, we
propose DiffSynth, a novel approach that aims to convert image synthesis
pipelines to video synthesis pipelines. DiffSynth consists of two key
components: a latent in-iteration deflickering framework and a video
deflickering algorithm. The latent in-iteration deflickering framework applies
video deflickering to the latent space of diffusion models, effectively
preventing flicker accumulation in intermediate steps. Additionally, we propose
a video deflickering algorithm, named patch blending algorithm, that remaps
objects in different frames and blends them together to enhance video
consistency. One of the notable advantages of DiffSynth is its general
applicability to various video synthesis tasks, including text-guided video
stylization, fashion video synthesis, image-guided video stylization, video
restoring, and 3D rendering. In the task of text-guided video stylization, we
make it possible to synthesize high-quality videos without cherry-picking. The
experimental results demonstrate the effectiveness of DiffSynth. All videos can
be viewed on our project page. Source codes will also be released.
Related papers
- BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [40.73982918337828]
We propose a training-free general-purpose video synthesis framework, coined as bf BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models.
Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models.
arXiv Detail & Related papers (2023-12-05T14:56:55Z) - SmoothVideo: Smooth Video Synthesis with Noise Constraints on Diffusion
Models for One-shot Video Tuning [18.979299814757997]
One-shot video tuning methods produce videos marred by incoherence and inconsistency.
This paper introduces a simple yet effective noise constraint across video frames.
By applying the loss to existing one-shot video tuning methods, we significantly improve the overall consistency and smoothness of the generated videos.
arXiv Detail & Related papers (2023-11-29T11:14:43Z) - FusionFrames: Efficient Architectural Aspects for Text-to-Video
Generation Pipeline [4.295130967329365]
This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model.
The design of our model significantly reduces computational costs compared to other masked frame approaches.
We evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores.
arXiv Detail & Related papers (2023-11-22T00:26:15Z) - ControlVideo: Training-free Controllable Text-to-Video Generation [117.06302461557044]
ControlVideo is a framework to enable natural and efficient text-to-video generation.
It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
arXiv Detail & Related papers (2023-05-22T14:48:53Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - WALDO: Future Video Synthesis using Object Layer Decomposition and
Parametric Flow Prediction [82.79642869586587]
WALDO is a novel approach to the prediction of future video frames from past ones.
Individual images are decomposed into multiple layers combining object masks and a small set of control points.
The layer structure is shared across all frames in each video to build dense inter-frame connections.
arXiv Detail & Related papers (2022-11-25T18:59:46Z) - A Good Image Generator Is What You Need for High-Resolution Video
Synthesis [73.82857768949651]
We present a framework that leverages contemporary image generators to render high-resolution videos.
We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator.
We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled.
arXiv Detail & Related papers (2021-04-30T15:38:41Z) - Strumming to the Beat: Audio-Conditioned Contrastive Video Textures [112.6140796961121]
We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning.
We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order.
Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.
arXiv Detail & Related papers (2021-04-06T17:24:57Z) - Street-view Panoramic Video Synthesis from a Single Satellite Image [92.26826861266784]
We present a novel method for synthesizing both temporally and geometrically consistent street-view panoramic video.
Existing cross-view synthesis approaches focus more on images, while video synthesis in such a case has not yet received enough attention.
arXiv Detail & Related papers (2020-12-11T20:22:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.