SmoothVideo: Smooth Video Synthesis with Noise Constraints on Diffusion
Models for One-shot Video Tuning
- URL: http://arxiv.org/abs/2311.17536v2
- Date: Tue, 6 Feb 2024 12:01:47 GMT
- Title: SmoothVideo: Smooth Video Synthesis with Noise Constraints on Diffusion
Models for One-shot Video Tuning
- Authors: Liang Peng, Haoran Cheng, Zheng Yang, Ruisi Zhao, Linxuan Xia,
Chaotian Song, Qinglin Lu, Boxi Wu, Wei Liu
- Abstract summary: One-shot video tuning methods produce videos marred by incoherence and inconsistency.
This paper introduces a simple yet effective noise constraint across video frames.
By applying the loss to existing one-shot video tuning methods, we significantly improve the overall consistency and smoothness of the generated videos.
- Score: 18.979299814757997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent one-shot video tuning methods, which fine-tune the network on a
specific video based on pre-trained text-to-image models (e.g., Stable
Diffusion), are popular in the community because of the flexibility. However,
these methods often produce videos marred by incoherence and inconsistency. To
address these limitations, this paper introduces a simple yet effective noise
constraint across video frames. This constraint aims to regulate noise
predictions across their temporal neighbors, resulting in smooth latents. It
can be simply included as a loss term during the training phase. By applying
the loss to existing one-shot video tuning methods, we significantly improve
the overall consistency and smoothness of the generated videos. Furthermore, we
argue that current video evaluation metrics inadequately capture smoothness. To
address this, we introduce a novel metric that considers detailed features and
their temporal dynamics. Experimental results validate the effectiveness of our
approach in producing smoother videos on various one-shot video tuning
baselines. The source codes and video demos are available at
\href{https://github.com/SPengLiang/SmoothVideo}{https://github.com/SPengLiang/SmoothVideo}.
Related papers
- Fine-gained Zero-shot Video Sampling [21.42513407755273]
We propose a novel Zero-Shot video sampling algorithm, denoted as $mathcalZS2$.
$mathcalZS2$ is capable of directly sampling high-quality video clips without any training or optimization.
It achieves state-of-the-art performance in zero-shot video generation, occasionally outperforming recent supervised methods.
arXiv Detail & Related papers (2024-07-31T09:36:58Z) - CoNo: Consistency Noise Injection for Tuning-free Long Video Diffusion [15.013908857230966]
"Look-back" mechanism enhances the fine-grained scene transition between different video clips.
Long-term consistency regularization focuses on explicitly minimizing the pixel-wise distance between the predicted noises of the extended video clip and the original one.
Experiments have shown the effectiveness of the strategies by performing long-video generation under both single- and multi-text prompt conditions.
arXiv Detail & Related papers (2024-06-07T16:56:42Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference.
This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts.
We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z) - DiffSynth: Latent In-Iteration Deflickering for Realistic Video
Synthesis [15.857449277106827]
DiffSynth is a novel approach to convert image synthesis pipelines to video synthesis pipelines.
It consists of a latent in-it deflickering framework and a video deflickering algorithm.
One of the notable advantages of Diff Synth is its general applicability to various video synthesis tasks.
arXiv Detail & Related papers (2023-08-07T10:41:52Z) - ControlVideo: Training-free Controllable Text-to-Video Generation [117.06302461557044]
ControlVideo is a framework to enable natural and efficient text-to-video generation.
It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
arXiv Detail & Related papers (2023-05-22T14:48:53Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - Strumming to the Beat: Audio-Conditioned Contrastive Video Textures [112.6140796961121]
We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning.
We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order.
Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.
arXiv Detail & Related papers (2021-04-06T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.