MeDM: Mediating Image Diffusion Models for Video-to-Video Translation
with Temporal Correspondence Guidance
- URL: http://arxiv.org/abs/2308.10079v3
- Date: Wed, 20 Dec 2023 08:49:59 GMT
- Title: MeDM: Mediating Image Diffusion Models for Video-to-Video Translation
with Temporal Correspondence Guidance
- Authors: Ernie Chu, Tzuhsuan Huang, Shuo-Yen Lin, Jun-Cheng Chen
- Abstract summary: This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow.
We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores.
- Score: 10.457759140533168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study introduces an efficient and effective method, MeDM, that utilizes
pre-trained image Diffusion Models for video-to-video translation with
consistent temporal flow. The proposed framework can render videos from scene
position information, such as a normal G-buffer, or perform text-guided editing
on videos captured in real-world scenarios. We employ explicit optical flows to
construct a practical coding that enforces physical constraints on generated
frames and mediates independent frame-wise scores. By leveraging this coding,
maintaining temporal consistency in the generated videos can be framed as an
optimization problem with a closed-form solution. To ensure compatibility with
Stable Diffusion, we also suggest a workaround for modifying observation-space
scores in latent Diffusion Models. Notably, MeDM does not require fine-tuning
or test-time optimization of the Diffusion Models. Through extensive
qualitative, quantitative, and subjective experiments on various benchmarks,
the study demonstrates the effectiveness and superiority of the proposed
approach. Our project page can be found at https://medm2023.github.io
Related papers
- VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models.
It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process.
The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z) - ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner.
We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules.
Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z) - Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models [96.97910688908956]
We introduce the first zero-shot approach for Video Semantic (VSS) based on pre-trained diffusion models.
We propose a framework tailored for VSS based on pre-trained image and video diffusion models.
Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches.
arXiv Detail & Related papers (2024-05-27T08:39:38Z) - FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint.
This enhancement ensures a more consistent transformation of semantically similar content across frames.
Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z) - Inflation with Diffusion: Efficient Temporal Adaptation for
Text-to-Video Super-Resolution [19.748048455806305]
We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach.
We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality.
arXiv Detail & Related papers (2024-01-18T22:25:16Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - VIDM: Video Implicit Diffusion Models [75.90225524502759]
Diffusion models have emerged as a powerful generative method for synthesizing high-quality and diverse set of images.
We propose a video generation method based on diffusion models, where the effects of motion are modeled in an implicit condition.
We improve the quality of the generated videos by proposing multiple strategies such as sampling space truncation, robustness penalty, and positional group normalization.
arXiv Detail & Related papers (2022-12-01T02:58:46Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.