Related papers: DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

URL: http://arxiv.org/abs/2505.16239v1
Date: Thu, 22 May 2025 05:16:45 GMT
Title: DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution
Authors: Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, Yulun Zhang,
Abstract summary: We propose DOVE, an efficient one-step diffusion model for real-world video super-resolution.<n>DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX)<n>Experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods.
Score: 43.83739935393097
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a **28$\times$** speed-up over existing methods such as MGLD-VSR. Code is available at: https://github.com/zhengchen1999/DOVE.

Related papers

Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution [36.32266529540775]
One-step networks like SeedVR2, DOVE, and DLoRAL are heavy with billions of parameters and multi-second latency.<n>Recent adversarial diffusion compression (ADC) offers a promising path via pruning and distilling these models into a compact AdcSR network.<n>We propose an improved ADC method for Real-VSR that balances spatial details and temporal consistency.
arXiv Detail & Related papers (2026-02-28T04:30:54Z)
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution [61.284842030283464]
FlashVSR is the first diffusion-based one-step streaming framework towards real-time VSR.<n>It runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU.<n>It scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models.
arXiv Detail & Related papers (2025-10-14T17:25:54Z)
Real-Time Motion-Controllable Autoregressive Video Diffusion [79.32730467857535]
We propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control.<n>We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement with a trajectory-based reward model.<n>Our design preserves the Markov property through a Self-Rollout learning mechanism and accelerates training by selectively denoising steps.
arXiv Detail & Related papers (2025-10-09T12:17:11Z)
Asymmetric VAE for One-Step Video Super-Resolution Acceleration [63.419142632861345]
We propose FastVSR, which achieves substantial reductions in computational cost by implementing a high compression VAE.<n>FastVSR achieves speedups of 111.9 times compared to multi-step models and 3.92 times compared to existing one-step models.
arXiv Detail & Related papers (2025-09-29T00:36:14Z)
OS-DiffVSR: Towards One-step Latent Diffusion Model for High-detailed Real-world Video Super-Resolution [11.859297492802456]
We propose One-Step Diffusion model for real-world Video Super-Resolution, namely OS-DiffVSR.<n>Specifically, we devise a novel adjacent frame adversarial training paradigm, which can significantly improve the quality of synthetic videos.
arXiv Detail & Related papers (2025-09-20T03:04:41Z)
RealisVSR: Detail-enhanced Diffusion for Real-World 4K Video Super-Resolution [42.96414692062782]
RealisVSR is a high-frequency detail-enhanced video diffusion model with three core innovations.<n>Our method requires only 5-25% of the training data volume compared to existing approaches.
arXiv Detail & Related papers (2025-07-25T10:18:33Z)
SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution [55.14432034345353]
We study key design principles for latter cascaded video super-resolution models, which are underexplored currently.<n>First, we propose two strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator.<n>Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs.
arXiv Detail & Related papers (2025-06-24T17:57:26Z)
SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation [26.045123066151838]
SRDiffusion is a novel framework that leverages collaboration between large and small models to reduce inference cost.<n>Our method is introduced as a new direction to existing acceleration strategies, offering a practical solution for scalable video generation.
arXiv Detail & Related papers (2025-05-25T13:58:52Z)
One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation [60.54811860967658]
FluxSR is a novel one-step diffusion Real-ISR based on flow matching models.<n>First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR.<n>Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss.
arXiv Detail & Related papers (2025-02-04T04:11:29Z)
SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models. Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z)
Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution [15.197746480157651]
We propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models. We exploit the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss. The proposed motion-guided latent diffusion based VSR algorithm achieves significantly better perceptual quality than state-of-the-arts on real-world VSR benchmark datasets.
arXiv Detail & Related papers (2023-12-01T14:40:07Z)
Benchmark Dataset and Effective Inter-Frame Alignment for Real-World Video Super-Resolution [65.20905703823965]
Video super-resolution (VSR) aiming to reconstruct a high-resolution (HR) video from its low-resolution (LR) counterpart has made tremendous progress in recent years. It remains challenging to deploy existing VSR methods to real-world data with complex degradations. EAVSR takes the proposed multi-layer adaptive spatial transform network (MultiAdaSTN) to refine the offsets provided by the pre-trained optical flow estimation network.
arXiv Detail & Related papers (2022-12-10T17:41:46Z)
Investigating Tradeoffs in Real-World Video Super-Resolution [90.81396836308085]
Real-world video super-resolution (VSR) models are often trained with diverse degradations to improve generalizability. To alleviate the first tradeoff, we propose a degradation scheme that reduces up to 40% of training time without sacrificing performance. To facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences.
arXiv Detail & Related papers (2021-11-24T18:58:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.