Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution
- URL: http://arxiv.org/abs/2312.06640v1
- Date: Mon, 11 Dec 2023 18:54:52 GMT
- Title: Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution
- Authors: Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, Chen Change Loy
- Abstract summary: Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
- Score: 65.91317390645163
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based diffusion models have exhibited remarkable success in generation
and editing, showing great promise for enhancing visual content with their
generative prior. However, applying these models to video super-resolution
remains challenging due to the high demands for output fidelity and temporal
consistency, which is complicated by the inherent randomness in diffusion
models. Our study introduces Upscale-A-Video, a text-guided latent diffusion
framework for video upscaling. This framework ensures temporal coherence
through two key mechanisms: locally, it integrates temporal layers into U-Net
and VAE-Decoder, maintaining consistency within short sequences; globally,
without training, a flow-guided recurrent latent propagation module is
introduced to enhance overall video stability by propagating and fusing latent
across the entire sequences. Thanks to the diffusion paradigm, our model also
offers greater flexibility by allowing text prompts to guide texture creation
and adjustable noise levels to balance restoration and generation, enabling a
trade-off between fidelity and quality. Extensive experiments show that
Upscale-A-Video surpasses existing methods in both synthetic and real-world
benchmarks, as well as in AI-generated videos, showcasing impressive visual
realism and temporal consistency.
Related papers
- Exploring Representation-Aligned Latent Space for Better Generation [86.45670422239317]
We introduce ReaLS, which integrates semantic priors to improve generation performance.
We show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric.
The enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
arXiv Detail & Related papers (2025-02-01T07:42:12Z) - DiffVSR: Enhancing Real-World Video Super-Resolution with Diffusion Models for Advanced Visual Quality and Temporal Consistency [25.756755602342942]
We present DiffVSR, a diffusion-based framework for real-world video super-resolution.
For intra-sequence coherence, we develop a multi-scale temporal attention module and temporal-enhanced VAE decoder.
We propose a progressive learning strategy that transitions from simple to complex degradations, enabling robust optimization.
arXiv Detail & Related papers (2025-01-17T10:53:03Z) - DiffuEraser: A Diffusion Model for Video Inpainting [13.292164408616257]
We introduce DiffuEraser, a video inpainting model based on stable diffusion, to fill masked regions with greater details and more coherent structures.
We also expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models.
arXiv Detail & Related papers (2025-01-17T08:03:02Z) - Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion [22.988212617368095]
We propose GLC-Diffusion, a tuning-free method for long video generation.
It models the long video denoising process by establishing Global-Local Collaborative Denoising.
We also propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses.
arXiv Detail & Related papers (2025-01-08T05:49:39Z) - STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution [42.859188375578604]
Image diffusion models have been adapted for real-world video superresolution to tackle over-smoothing issues in GAN-based methods.
These models struggle to maintain temporal consistency, as they are trained on static images.
We introduce a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency.
arXiv Detail & Related papers (2025-01-06T12:36:21Z) - VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models.
It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process.
The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z) - FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint.
This enhancement ensures a more consistent transformation of semantically similar content across frames.
Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z) - Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once.
This is in contrast to existing video models which synthesize distants followed by temporal super-resolution.
By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z) - Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution [15.197746480157651]
We propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models.
We exploit the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss.
The proposed motion-guided latent diffusion based VSR algorithm achieves significantly better perceptual quality than state-of-the-arts on real-world VSR benchmark datasets.
arXiv Detail & Related papers (2023-12-01T14:40:07Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.