Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
- URL: http://arxiv.org/abs/2306.07954v2
- Date: Sun, 17 Sep 2023 09:57:20 GMT
- Title: Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
- Authors: Shuai Yang, Yifan Zhou, Ziwei Liu and Chen Change Loy
- Abstract summary: This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
- Score: 93.18163456287164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large text-to-image diffusion models have exhibited impressive proficiency in
generating high-quality images. However, when applying these models to video
domain, ensuring temporal consistency across video frames remains a formidable
challenge. This paper proposes a novel zero-shot text-guided video-to-video
translation framework to adapt image models to videos. The framework includes
two parts: key frame translation and full video translation. The first part
uses an adapted diffusion model to generate key frames, with hierarchical
cross-frame constraints applied to enforce coherence in shapes, textures and
colors. The second part propagates the key frames to other frames with
temporal-aware patch matching and frame blending. Our framework achieves global
style and local texture temporal consistency at a low cost (without re-training
or optimization). The adaptation is compatible with existing image diffusion
techniques, allowing our framework to take advantage of them, such as
customizing a specific subject with LoRA, and introducing extra spatial
guidance with ControlNet. Extensive experimental results demonstrate the
effectiveness of our proposed framework over existing methods in rendering
high-quality and temporally-coherent videos.
Related papers
- LoopAnimate: Loopable Salient Object Animation [19.761865029125524]
LoopAnimate is a novel method for generating videos with consistent start and end frames.
It achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.
arXiv Detail & Related papers (2024-04-14T07:36:18Z) - Generative Video Diffusion for Unseen Cross-Domain Video Moment
Retrieval [58.17315970207874]
Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships.
Existing methods resort to joint training on both source and target domain videos for cross-domain applications.
We explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences.
arXiv Detail & Related papers (2024-01-24T09:45:40Z) - VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z) - Highly Detailed and Temporal Consistent Video Stylization via
Synchronized Multi-Frame Diffusion [22.33952368534147]
Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts.
Existing text-guided image diffusion models can be extended for stylized video synthesis.
We propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency.
arXiv Detail & Related papers (2023-11-24T08:38:19Z) - FusionFrames: Efficient Architectural Aspects for Text-to-Video
Generation Pipeline [4.295130967329365]
This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model.
The design of our model significantly reduces computational costs compared to other masked frame approaches.
We evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores.
arXiv Detail & Related papers (2023-11-22T00:26:15Z) - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM.
We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z) - ControlVideo: Training-free Controllable Text-to-Video Generation [117.06302461557044]
ControlVideo is a framework to enable natural and efficient text-to-video generation.
It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
arXiv Detail & Related papers (2023-05-22T14:48:53Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - ALANET: Adaptive Latent Attention Network forJoint Video Deblurring and
Interpolation [38.52446103418748]
We introduce a novel architecture, Adaptive Latent Attention Network (ALANET), which synthesizes sharp high frame-rate videos.
We employ combination of self-attention and cross-attention module between consecutive frames in the latent space to generate optimized representation for each frame.
Our method performs favorably against various state-of-the-art approaches, even though we tackle a much more difficult problem.
arXiv Detail & Related papers (2020-08-31T21:11:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.