Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video
Translation Using Conditional Image Diffusion Models
- URL: http://arxiv.org/abs/2305.19193v1
- Date: Tue, 30 May 2023 16:39:00 GMT
- Title: Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video
Translation Using Conditional Image Diffusion Models
- Authors: Ernie Chu, Shuo-Yen Lin, Jun-Cheng Chen
- Abstract summary: We present an efficient and effective approach for achieving temporally consistent synthetic-to-real video translation in videos of varying lengths.
Our method leverages off-the-shelf conditional image diffusion models, allowing us to perform multiple synthetic-to-real image generations in parallel.
- Score: 15.572275049552255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we present an efficient and effective approach for achieving
temporally consistent synthetic-to-real video translation in videos of varying
lengths. Our method leverages off-the-shelf conditional image diffusion models,
allowing us to perform multiple synthetic-to-real image generations in
parallel. By utilizing the available optical flow information from the
synthetic videos, our approach seamlessly enforces temporal consistency among
corresponding pixels across frames. This is achieved through joint noise
optimization, effectively minimizing spatial and temporal discrepancies. To the
best of our knowledge, our proposed method is the first to accomplish diverse
and temporally consistent synthetic-to-real video translation using conditional
image diffusion models. Furthermore, our approach does not require any training
or fine-tuning of the diffusion models. Extensive experiments conducted on
various benchmarks for synthetic-to-real video translation demonstrate the
effectiveness of our approach, both quantitatively and qualitatively. Finally,
we show that our method outperforms other baseline methods in terms of both
temporal consistency and visual quality.
Related papers
- Optical-Flow Guided Prompt Optimization for Coherent Video Generation [51.430833518070145]
We propose a framework called MotionPrompt that guides the video generation process via optical flow.
We optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs.
This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content.
arXiv Detail & Related papers (2024-11-23T12:26:52Z) - FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint.
This enhancement ensures a more consistent transformation of semantically similar content across frames.
Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z) - Inflation with Diffusion: Efficient Temporal Adaptation for
Text-to-Video Super-Resolution [19.748048455806305]
We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach.
We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality.
arXiv Detail & Related papers (2024-01-18T22:25:16Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - AdaDiff: Adaptive Step Selection for Fast Diffusion [88.8198344514677]
We introduce AdaDiff, a framework designed to learn instance-specific step usage policies.
AdaDiff is optimized using a policy gradient method to maximize a carefully designed reward function.
Our approach achieves similar results in terms of visual quality compared to the baseline using a fixed 50 denoising steps.
arXiv Detail & Related papers (2023-11-24T11:20:38Z) - Highly Detailed and Temporal Consistent Video Stylization via
Synchronized Multi-Frame Diffusion [22.33952368534147]
Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts.
Existing text-guided image diffusion models can be extended for stylized video synthesis.
We propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency.
arXiv Detail & Related papers (2023-11-24T08:38:19Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - VIDM: Video Implicit Diffusion Models [75.90225524502759]
Diffusion models have emerged as a powerful generative method for synthesizing high-quality and diverse set of images.
We propose a video generation method based on diffusion models, where the effects of motion are modeled in an implicit condition.
We improve the quality of the generated videos by proposing multiple strategies such as sampling space truncation, robustness penalty, and positional group normalization.
arXiv Detail & Related papers (2022-12-01T02:58:46Z) - Adaptive Compact Attention For Few-shot Video-to-video Translation [13.535988102579918]
We introduce a novel adaptive compact attention mechanism to efficiently extract contextual features jointly from multiple reference images.
Our core idea is to extract compact basis sets from all the reference images as higher-level representations.
We extensively evaluate our method on a large-scale talking-head video dataset and a human dancing dataset.
arXiv Detail & Related papers (2020-11-30T11:19:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.