FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution
- URL: http://arxiv.org/abs/2510.12747v1
- Date: Tue, 14 Oct 2025 17:25:54 GMT
- Title: FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution
- Authors: Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, Tianfan Xue,
- Abstract summary: FlashVSR is the first diffusion-based one-step streaming framework towards real-time VSR.<n>It runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU.<n>It scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models.
- Score: 61.284842030283464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.
Related papers
- Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion [10.847237180991948]
Stream-DiffVSR is a causally conditioned diffusion framework for efficient online VSR.<n>It processes 720p frames in 0.328 seconds on an GTX4090 GPU.<n>It boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x.
arXiv Detail & Related papers (2025-12-29T18:59:57Z) - InfVSR: Breaking Length Limits of Generic Video Super-Resolution [40.30527504651693]
InfVSR is an autoregressive-one-step-diffusion paradigm for long sequences.<n>We distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching.<n>Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods.
arXiv Detail & Related papers (2025-10-01T14:21:45Z) - Asymmetric VAE for One-Step Video Super-Resolution Acceleration [63.419142632861345]
We propose FastVSR, which achieves substantial reductions in computational cost by implementing a high compression VAE.<n>FastVSR achieves speedups of 111.9 times compared to multi-step models and 3.92 times compared to existing one-step models.
arXiv Detail & Related papers (2025-09-29T00:36:14Z) - OS-DiffVSR: Towards One-step Latent Diffusion Model for High-detailed Real-world Video Super-Resolution [11.859297492802456]
We propose One-Step Diffusion model for real-world Video Super-Resolution, namely OS-DiffVSR.<n>Specifically, we devise a novel adjacent frame adversarial training paradigm, which can significantly improve the quality of synthetic videos.
arXiv Detail & Related papers (2025-09-20T03:04:41Z) - TurboVSR: Fantastic Video Upscalers and Where to Find Them [33.83721799307721]
Diffusion-based generative models have demonstrated exceptional promise in the video super-resolution (VSR) task.<n>We present TurboVSR, an ultra-efficient diffusion-based video super-resolution model.<n>TurboVSR performs on par with state-of-the-art VSR methods, while being 100+ times faster, taking only 7 seconds to process a 2-second long 1080p video.
arXiv Detail & Related papers (2025-06-30T08:24:13Z) - FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution [68.77813885751308]
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information.<n>We propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data.<n>Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames.
arXiv Detail & Related papers (2025-06-13T07:59:52Z) - DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution [43.83739935393097]
We propose DOVE, an efficient one-step diffusion model for real-world video super-resolution.<n>DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX)<n>Experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods.
arXiv Detail & Related papers (2025-05-22T05:16:45Z) - Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - Investigating Tradeoffs in Real-World Video Super-Resolution [90.81396836308085]
Real-world video super-resolution (VSR) models are often trained with diverse degradations to improve generalizability.
To alleviate the first tradeoff, we propose a degradation scheme that reduces up to 40% of training time without sacrificing performance.
To facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences.
arXiv Detail & Related papers (2021-11-24T18:58:21Z) - Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video
Super-Resolution [95.26202278535543]
A simple solution is to split it into two sub-tasks: video frame (VFI) and video super-resolution (VSR)
temporalsynthesis and spatial super-resolution are intra-related in this task.
We propose a one-stage space-time video super-resolution framework, which directly synthesizes an HR slow-motion video from an LFR, LR video.
arXiv Detail & Related papers (2020-02-26T16:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.