Related papers: MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance

MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance

URL: http://arxiv.org/abs/2308.10079v3
Date: Wed, 20 Dec 2023 08:49:59 GMT
Title: MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance
Authors: Ernie Chu, Tzuhsuan Huang, Shuo-Yen Lin, Jun-Cheng Chen
Abstract summary: This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores.
Score: 10.457759140533168
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observation-space scores in latent Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion Models. Through extensive qualitative, quantitative, and subjective experiments on various benchmarks, the study demonstrates the effectiveness and superiority of the proposed approach. Our project page can be found at https://medm2023.github.io

Related papers

Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models [34.131515004434846]
We introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient approach for adapting pretrained video diffusion models to conditional generation tasks.<n>TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples.<n>We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B.
arXiv Detail & Related papers (2025-06-01T12:57:43Z)
EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation [26.888320234592978]
Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. We provide a model-agnostic approach, using intersections in diffusion trajectories, working only with latent values. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models.
arXiv Detail & Related papers (2025-04-09T13:11:09Z)
Video Latent Flow Matching: Optimal Polynomial Projections for Video Interpolation and Extrapolation [11.77588746719272]
This paper considers an efficient video modeling process called Video Latent Flow Matching (VLFM) Our method relies on current strong pre-trained image generation models, modeling a certain caption-guided flow of latent patches that can be decoded to time-dependent video frames. We conduct experiments on several text-to-video datasets to showcase the effectiveness of our method.
arXiv Detail & Related papers (2025-02-01T17:40:11Z)
Optical-Flow Guided Prompt Optimization for Coherent Video Generation [51.430833518070145]
We propose a framework called MotionPrompt that guides the video generation process via optical flow. We optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content.
arXiv Detail & Related papers (2024-11-23T12:26:52Z)
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models. It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z)
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner. We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules. Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z)
Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models [96.97910688908956]
We introduce the first zero-shot approach for Video Semantic (VSS) based on pre-trained diffusion models. We propose a framework tailored for VSS based on pre-trained image and video diffusion models. Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches.
arXiv Detail & Related papers (2024-05-27T08:39:38Z)
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z)
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution [19.748048455806305]
We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality.
arXiv Detail & Related papers (2024-01-18T22:25:16Z)
VIDM: Video Implicit Diffusion Models [75.90225524502759]
Diffusion models have emerged as a powerful generative method for synthesizing high-quality and diverse set of images. We propose a video generation method based on diffusion models, where the effects of motion are modeled in an implicit condition. We improve the quality of the generated videos by proposing multiple strategies such as sampling space truncation, robustness penalty, and positional group normalization.
arXiv Detail & Related papers (2022-12-01T02:58:46Z)
Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process. We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.