Towards High-Quality and Efficient Video Super-Resolution via
Spatial-Temporal Data Overfitting
- URL: http://arxiv.org/abs/2303.08331v2
- Date: Sun, 18 Jun 2023 15:29:37 GMT
- Title: Towards High-Quality and Efficient Video Super-Resolution via
Spatial-Temporal Data Overfitting
- Authors: Gen Li, Jie Ji, Minghai Qin, Wei Niu, Bin Ren, Fatemeh Afghah, Linke
Guo, Xiaolong Ma
- Abstract summary: Deep convolutional neural networks (DNNs) are widely used in various fields of computer vision.
We propose a novel method for high-quality and efficient video resolution upscaling tasks.
We deploy our models on an off-the-shelf mobile phone, and experimental results show that our method achieves real-time video super-resolution with high video quality.
- Score: 27.302681897961588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As deep convolutional neural networks (DNNs) are widely used in various
fields of computer vision, leveraging the overfitting ability of the DNN to
achieve video resolution upscaling has become a new trend in the modern video
delivery system. By dividing videos into chunks and overfitting each chunk with
a super-resolution model, the server encodes videos before transmitting them to
the clients, thus achieving better video quality and transmission efficiency.
However, a large number of chunks are expected to ensure good overfitting
quality, which substantially increases the storage and consumes more bandwidth
resources for data transmission. On the other hand, decreasing the number of
chunks through training optimization techniques usually requires high model
capacity, which significantly slows down execution speed. To reconcile such, we
propose a novel method for high-quality and efficient video resolution
upscaling tasks, which leverages the spatial-temporal information to accurately
divide video into chunks, thus keeping the number of chunks as well as the
model size to minimum. Additionally, we advance our method into a single
overfitting model by a data-aware joint training technique, which further
reduces the storage requirement with negligible quality drop. We deploy our
models on an off-the-shelf mobile phone, and experimental results show that our
method achieves real-time video super-resolution with high video quality.
Compared with the state-of-the-art, our method achieves 28 fps streaming speed
with 41.6 PSNR, which is 14$\times$ faster and 2.29 dB better in the live video
resolution upscaling tasks. Code available in
https://github.com/coulsonlee/STDO-CVPR2023.git
Related papers
- EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training [15.684865589513597]
We propose an efficient patch sampling method named EPS for video SR network overfitting.
Our method reduces the number of patches for the training to 4% to 25%, depending on the resolution and number of clusters.
Compared to the state-of-the-art patch sampling method, EMT, our approach achieves an 83% decrease in overall run time.
arXiv Detail & Related papers (2024-11-25T12:01:57Z) - Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds.
We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache)
We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z) - Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design [18.57172631588624]
We propose a Dynamic Deep neural network assisted by a Content-Aware data processing pipeline to reduce the model number down to one.
Our method achieves better PSNR and real-time performance (33 FPS) on an off-the-shelf mobile phone.
arXiv Detail & Related papers (2024-07-03T05:17:26Z) - Hierarchical Patch Diffusion Models for High-Resolution Video Generation [50.42746357450949]
We develop deep context fusion, which propagates context information from low-scale to high-scale patches in a hierarchical manner.
We also propose adaptive computation, which allocates more network capacity and computation towards coarse image details.
The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation.
arXiv Detail & Related papers (2024-06-12T01:12:53Z) - Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition [124.41196697408627]
We propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation.
CMD encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation.
We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model.
arXiv Detail & Related papers (2024-03-21T05:48:48Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - AsConvSR: Fast and Lightweight Super-Resolution Network with Assembled
Convolutions [32.85522513271578]
We propose a fast and lightweight super-resolution network to achieve real-time performance.
By analyzing the applications of divide-and-conquer in super-resolution, we propose assembled convolutions which can adapt convolution kernels according to the input features.
Our method also wins the first place in NTIRE 2023 Real-Time Super-Resolution - Track 1.
arXiv Detail & Related papers (2023-05-05T09:33:34Z) - HNeRV: A Hybrid Neural Representation for Videos [56.492309149698606]
Implicit neural representations store videos as neural networks.
We propose a Hybrid Neural Representation for Videos (HNeRV)
With content-adaptive embeddings and re-designed architecture, HNeRV outperforms implicit methods in video regression tasks.
arXiv Detail & Related papers (2023-04-05T17:55:04Z) - Leveraging Bitstream Metadata for Fast, Accurate, Generalized Compressed
Video Quality Enhancement [74.1052624663082]
We develop a deep learning architecture capable of restoring detail to compressed videos.
We show that this improves restoration accuracy compared to prior compression correction methods.
We condition our model on quantization data which is readily available in the bitstream.
arXiv Detail & Related papers (2022-01-31T18:56:04Z) - Overfitting the Data: Compact Neural Video Delivery via Content-aware
Feature Modulation [38.889823516049056]
Methods divide a video into chunks, and stream LR video chunks and corresponding content-aware models to the client.
With our method, each video chunk only requires less than $1% $ of original parameters to be streamed, achieving even better SR performance.
arXiv Detail & Related papers (2021-08-18T15:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.