Related papers: Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment

Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment

URL: http://arxiv.org/abs/2503.03355v4
Date: Thu, 08 May 2025 15:38:04 GMT
Title: Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment
Authors: Zhihao Zhan, Wang Pang, Xiang Zhu, Yechao Bai,
Abstract summary: We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge.<n>A single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training.<n> Empirical results on synthetic and real-world datasets illustrate the feasibility of diffusion-based, alignment-free video super-resolution.
Score: 3.052019331122618
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Empirical results on synthetic and real-world datasets illustrate the feasibility of diffusion-based, alignment-free video super-resolution.

Related papers

FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation [51.110607281391154]
FlowMo is a training-free guidance method for enhancing motion coherence in text-to-video models.<n>It estimates motion coherence by measuring the patch-wise variance across the temporal dimension and guides the model to reduce this variance dynamically during sampling.
arXiv Detail & Related papers (2025-06-01T19:55:33Z)
RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism [73.38167494118746]
We propose a framework to improve the realism of motion in generated videos. We advocate for the incorporation of a retrieval mechanism during the generation phase. Our pipeline is designed to apply to any text-to-video diffusion model.
arXiv Detail & Related papers (2025-04-09T08:14:05Z)
Image Motion Blur Removal in the Temporal Dimension with Video Diffusion Models [3.052019331122618]
We propose a novel single-image deblurring approach that treats motion blur as a temporal averaging phenomenon.<n>Our core innovation lies in leveraging a pre-trained video diffusion transformer model to capture diverse motion dynamics.<n> Empirical results on synthetic and real-world datasets demonstrate that our method outperforms existing techniques in deblurring complex motion blur scenarios.
arXiv Detail & Related papers (2025-01-22T03:01:54Z)
Diffusing Differentiable Representations [60.72992910766525]
We introduce a novel, training-free method for sampling differentiable representations (diffreps) using pretrained diffusion models.<n>We identify an implicit constraint on the samples induced by the diffrep and demonstrate that addressing this constraint significantly improves the consistency and detail of the generated objects.
arXiv Detail & Related papers (2024-12-09T20:42:58Z)
Optical-Flow Guided Prompt Optimization for Coherent Video Generation [51.430833518070145]
We propose a framework called MotionPrompt that guides the video generation process via optical flow. We optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content.
arXiv Detail & Related papers (2024-11-23T12:26:52Z)
Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method [60.88467353578118]
We show that a fixed-point-inspired iterative approach to invert real-world images does not achieve convergence, instead oscillating between distinct clusters. We introduce a simple and fast distribution transfer technique that facilitates image enhancement, stroke-based recoloring, as well as visual prompt-guided image editing.
arXiv Detail & Related papers (2024-11-17T17:45:37Z)
Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models [6.408114351192012]
Video models require extensive training and computational resources, leading to high costs and large environmental impacts. This paper introduces a novel approach to video generation by augmenting image diffusion models to create sequential animation frames while maintaining fine detail.
arXiv Detail & Related papers (2024-10-05T12:53:05Z)
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner. We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules. Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z)
Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution [151.1255837803585]
We propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo) for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-03-25T17:59:26Z)
Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once. This is in contrast to existing video models which synthesize distants followed by temporal super-resolution. By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z)
Video Probabilistic Diffusion Models in Projected Latent Space [75.4253202574722]
We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM) PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
arXiv Detail & Related papers (2023-02-15T14:22:34Z)
VIDM: Video Implicit Diffusion Models [75.90225524502759]
Diffusion models have emerged as a powerful generative method for synthesizing high-quality and diverse set of images. We propose a video generation method based on diffusion models, where the effects of motion are modeled in an implicit condition. We improve the quality of the generated videos by proposing multiple strategies such as sampling space truncation, robustness penalty, and positional group normalization.
arXiv Detail & Related papers (2022-12-01T02:58:46Z)
Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models. We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.