Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution
- URL: http://arxiv.org/abs/2312.06640v1
- Date: Mon, 11 Dec 2023 18:54:52 GMT
- Title: Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution
- Authors: Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, Chen Change Loy
- Abstract summary: Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
- Score: 65.91317390645163
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based diffusion models have exhibited remarkable success in generation
and editing, showing great promise for enhancing visual content with their
generative prior. However, applying these models to video super-resolution
remains challenging due to the high demands for output fidelity and temporal
consistency, which is complicated by the inherent randomness in diffusion
models. Our study introduces Upscale-A-Video, a text-guided latent diffusion
framework for video upscaling. This framework ensures temporal coherence
through two key mechanisms: locally, it integrates temporal layers into U-Net
and VAE-Decoder, maintaining consistency within short sequences; globally,
without training, a flow-guided recurrent latent propagation module is
introduced to enhance overall video stability by propagating and fusing latent
across the entire sequences. Thanks to the diffusion paradigm, our model also
offers greater flexibility by allowing text prompts to guide texture creation
and adjustable noise levels to balance restoration and generation, enabling a
trade-off between fidelity and quality. Extensive experiments show that
Upscale-A-Video surpasses existing methods in both synthetic and real-world
benchmarks, as well as in AI-generated videos, showcasing impressive visual
realism and temporal consistency.
Related papers
- ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation.
A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens.
An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z) - VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models.
It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process.
The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z) - Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach [29.753974393652356]
We propose a frame-aware video diffusion model(FVDM)
Our approach allows each frame to follow an independent noise schedule, enhancing the model's capacity to capture fine-grained temporal dependencies.
Our empirical evaluations show that FVDM outperforms state-of-the-art methods in video generation quality, while also excelling in extended tasks.
arXiv Detail & Related papers (2024-10-04T05:47:39Z) - JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation [6.463753697299011]
We introduce the Joint Video-Image Diffusion model (JVID), a novel approach to generating high-quality temporally coherent videos.
Our results demonstrate quantitative and qualitative improvements in producing realistic and coherent videos.
arXiv Detail & Related papers (2024-09-21T13:59:50Z) - FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint.
This enhancement ensures a more consistent transformation of semantically similar content across frames.
Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z) - Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once.
This is in contrast to existing video models which synthesize distants followed by temporal super-resolution.
By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z) - Inflation with Diffusion: Efficient Temporal Adaptation for
Text-to-Video Super-Resolution [19.748048455806305]
We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach.
We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality.
arXiv Detail & Related papers (2024-01-18T22:25:16Z) - Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution [15.197746480157651]
We propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models.
We exploit the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss.
The proposed motion-guided latent diffusion based VSR algorithm achieves significantly better perceptual quality than state-of-the-arts on real-world VSR benchmark datasets.
arXiv Detail & Related papers (2023-12-01T14:40:07Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.