SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration
- URL: http://arxiv.org/abs/2501.01320v3
- Date: Tue, 04 Feb 2025 18:29:36 GMT
- Title: SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration
- Authors: Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Fei Xiao, Chen Change Loy, Lu Jiang,
- Abstract summary: SeedVR is a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution.
It achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos.
- Score: 73.70209718408641
- License:
- Abstract: Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR's superiority over existing methods for generic video restoration.
Related papers
- DiffVSR: Enhancing Real-World Video Super-Resolution with Diffusion Models for Advanced Visual Quality and Temporal Consistency [25.756755602342942]
We present DiffVSR, a diffusion-based framework for real-world video super-resolution.
For intra-sequence coherence, we develop a multi-scale temporal attention module and temporal-enhanced VAE decoder.
We propose a progressive learning strategy that transitions from simple to complex degradations, enabling robust optimization.
arXiv Detail & Related papers (2025-01-17T10:53:03Z) - DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models [9.145545884814327]
This paper introduces a method for zero-shot video restoration using pre-trained image restoration diffusion models.
We show that our method achieves top performance in zero-shot video restoration.
Our technique works with any 2D restoration diffusion model, offering a versatile and powerful tool for video enhancement tasks without extensive retraining.
arXiv Detail & Related papers (2024-07-01T17:59:12Z) - ViStripformer: A Token-Efficient Transformer for Versatile Video
Restoration [42.356013390749204]
ViStripformer is an effective and efficient transformer architecture with much lower memory usage than the vanilla transformer.
It decomposes video frames into strip-shaped features in horizontal and vertical directions for Intra-SA and Inter-SA to address degradation patterns with various orientations and magnitudes.
arXiv Detail & Related papers (2023-12-22T08:05:38Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - ConVRT: Consistent Video Restoration Through Turbulence with Test-time
Optimization of Neural Video Representations [13.38405890753946]
We introduce a self-supervised method, Consistent Video Restoration through Turbulence (ConVRT)
ConVRT is a test-time optimization method featuring a neural video representation designed to enhance temporal consistency in restoration.
A key innovation of ConVRT is the integration of a pretrained vision-language model (CLIP) for semantic-oriented supervision.
arXiv Detail & Related papers (2023-12-07T20:19:48Z) - Cross-Consistent Deep Unfolding Network for Adaptive All-In-One Video
Restoration [78.14941737723501]
We propose a Cross-consistent Deep Unfolding Network (CDUN) for All-In-One VR.
By orchestrating two cascading procedures, CDUN achieves adaptive processing for diverse degradations.
In addition, we introduce a window-based inter-frame fusion strategy to utilize information from more adjacent frames.
arXiv Detail & Related papers (2023-09-04T14:18:00Z) - VideoINR: Learning Video Implicit Neural Representation for Continuous
Space-Time Super-Resolution [75.79379734567604]
We show that Video Implicit Neural Representation (VideoINR) can be decoded to videos of arbitrary spatial resolution and frame rate.
We show that VideoINR achieves competitive performances with state-of-the-art STVSR methods on common up-sampling scales.
arXiv Detail & Related papers (2022-06-09T17:45:49Z) - On the Generalization of BasicVSR++ to Video Deblurring and Denoising [98.99165593274304]
We extend BasicVSR++ to a generic framework for video restoration tasks.
In tasks where inputs and outputs possess identical spatial size, the input resolution is reduced by strided convolutions to maintain efficiency.
With only minimal changes from BasicVSR++, the proposed framework achieves compelling performance with great efficiency in various video restoration tasks.
arXiv Detail & Related papers (2022-04-11T17:59:56Z) - VRT: A Video Restoration Transformer [126.79589717404863]
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames.
We propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities.
arXiv Detail & Related papers (2022-01-28T17:54:43Z) - Evaluating Foveated Video Quality Using Entropic Differencing [1.5877673959068452]
We propose a full reference (FR) foveated image quality assessment algorithm, which employs the natural scene statistics of bandpass responses.
We evaluate the proposed algorithm by measuring the correlations of the predictions that FED makes against human judgements.
The performance of the proposed algorithm yields state-of-the-art as compared with other existing full reference algorithms.
arXiv Detail & Related papers (2021-06-12T16:29:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.