InfVSR: Breaking Length Limits of Generic Video Super-Resolution
- URL: http://arxiv.org/abs/2510.00948v1
- Date: Wed, 01 Oct 2025 14:21:45 GMT
- Title: InfVSR: Breaking Length Limits of Generic Video Super-Resolution
- Authors: Ziqing Zhang, Kai Liu, Zheng Chen, Xi Li, Yucong Chen, Bingnan Duan, Linghe Kong, Yulun Zhang,
- Abstract summary: InfVSR is an autoregressive-one-step-diffusion paradigm for long sequences.<n>We distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching.<n>Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods.
- Score: 40.30527504651693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world videos often extend over thousands of frames. Existing video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor scalability hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which novelly reformulates VSR as an autoregressive-one-step-diffusion paradigm. This enables streaming inference while fully leveraging pre-trained video diffusion priors. First, we adapt the pre-trained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. Together, these designs enable efficient and scalable VSR for unbounded-length videos. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Code will be available at https://github.com/Kai-Liu001/InfVSR.
Related papers
- Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers [95.68243351895107]
We propose a holistic, video-centric paradigm named textbfLocal textbfDiffusion textbfForcing for textbfVideo textbfFrame textbfInterpolation (LDF-VFI)<n>Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence.<n>LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per
arXiv Detail & Related papers (2026-01-21T12:58:52Z) - BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation [44.45173635133032]
BlockVid is a novel block diffusion framework equipped with semantic-aware sparse KV cache.<n> LV-Bench is a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence.
arXiv Detail & Related papers (2025-11-28T08:25:59Z) - FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution [61.284842030283464]
FlashVSR is the first diffusion-based one-step streaming framework towards real-time VSR.<n>It runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU.<n>It scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models.
arXiv Detail & Related papers (2025-10-14T17:25:54Z) - OS-DiffVSR: Towards One-step Latent Diffusion Model for High-detailed Real-world Video Super-Resolution [11.859297492802456]
We propose One-Step Diffusion model for real-world Video Super-Resolution, namely OS-DiffVSR.<n>Specifically, we devise a novel adjacent frame adversarial training paradigm, which can significantly improve the quality of synthetic videos.
arXiv Detail & Related papers (2025-09-20T03:04:41Z) - FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution [68.77813885751308]
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information.<n>We propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data.<n>Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames.
arXiv Detail & Related papers (2025-06-13T07:59:52Z) - UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space [46.43409853027655]
Diffusion models have shown great potential in generating realistic image detail.<n>Adapting these models to video super-resolution (VSR) remains challenging due to their inherentity and lack of temporal modeling.<n>We propose UltraVSR, a novel framework that enables ultra-realistic and temporally-coherent VSR through an efficient one-step diffusion space.
arXiv Detail & Related papers (2025-05-26T13:19:27Z) - VORTA: Efficient Video Diffusion via Routing Sparse Attention [54.84294780326206]
VORTA is an acceleration framework with two novel components.<n>It achieves an end-to-end speedup $1.76times$ without loss of quality on VBench.<n>It can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41times$ with negligible performance degradation.
arXiv Detail & Related papers (2025-05-24T17:46:47Z) - DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution [43.83739935393097]
We propose DOVE, an efficient one-step diffusion model for real-world video super-resolution.<n>DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX)<n>Experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods.
arXiv Detail & Related papers (2025-05-22T05:16:45Z) - Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - Fast Online Video Super-Resolution with Deformable Attention Pyramid [172.16491820970646]
Video super-resolution (VSR) has many applications that pose strict causal, real-time, and latency constraints, including video streaming and TV.
We propose a recurrent VSR architecture based on a deformable attention pyramid (DAP)
arXiv Detail & Related papers (2022-02-03T17:49:04Z) - Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video
Super-Resolution [95.26202278535543]
A simple solution is to split it into two sub-tasks: video frame (VFI) and video super-resolution (VSR)
temporalsynthesis and spatial super-resolution are intra-related in this task.
We propose a one-stage space-time video super-resolution framework, which directly synthesizes an HR slow-motion video from an LFR, LR video.
arXiv Detail & Related papers (2020-02-26T16:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.