STIP: A SpatioTemporal Information-Preserving and Perception-Augmented
Model for High-Resolution Video Prediction
- URL: http://arxiv.org/abs/2206.04381v1
- Date: Thu, 9 Jun 2022 09:49:04 GMT
- Title: STIP: A SpatioTemporal Information-Preserving and Perception-Augmented
Model for High-Resolution Video Prediction
- Authors: Zheng Chang, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao
- Abstract summary: We propose a Stemporal Information-Preserving and Perception-Augmented Model (STIP) to solve the above two problems.
The proposed model aims to preserve thetemporal information for videos during the feature extraction and the state transitions.
Experimental results show that the proposed STIP can predict videos with more satisfactory visual quality compared with a variety of state-of-the-art methods.
- Score: 78.129039340528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although significant achievements have been achieved by recurrent neural
network (RNN) based video prediction methods, their performance in datasets
with high resolutions is still far from satisfactory because of the information
loss problem and the perception-insensitive mean square error (MSE) based loss
functions. In this paper, we propose a Spatiotemporal Information-Preserving
and Perception-Augmented Model (STIP) to solve the above two problems. To solve
the information loss problem, the proposed model aims to preserve the
spatiotemporal information for videos during the feature extraction and the
state transitions, respectively. Firstly, a Multi-Grained Spatiotemporal
Auto-Encoder (MGST-AE) is designed based on the X-Net structure. The proposed
MGST-AE can help the decoders recall multi-grained information from the
encoders in both the temporal and spatial domains. In this way, more
spatiotemporal information can be preserved during the feature extraction for
high-resolution videos. Secondly, a Spatiotemporal Gated Recurrent Unit (STGRU)
is designed based on the standard Gated Recurrent Unit (GRU) structure, which
can efficiently preserve spatiotemporal information during the state
transitions. The proposed STGRU can achieve more satisfactory performance with
a much lower computation load compared with the popular Long Short-Term (LSTM)
based predictive memories. Furthermore, to improve the traditional MSE loss
functions, a Learned Perceptual Loss (LP-loss) is further designed based on the
Generative Adversarial Networks (GANs), which can help obtain a satisfactory
trade-off between the objective quality and the perceptual quality.
Experimental results show that the proposed STIP can predict videos with more
satisfactory visual quality compared with a variety of state-of-the-art
methods. Source code has been available at
\url{https://github.com/ZhengChang467/STIPHR}.
Related papers
- Temporal Feature Matters: A Framework for Diffusion Model Quantization [105.3033493564844]
Diffusion models rely on the time-step for the multi-round denoising.
We introduce a novel quantization framework that includes three strategies.
This framework preserves most of the temporal information and ensures high-quality end-to-end generation.
arXiv Detail & Related papers (2024-07-28T17:46:15Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Transformer-based Video Saliency Prediction with High Temporal Dimension
Decoding [12.595019348741042]
We propose a transformer-based video saliency prediction approach with high temporal dimension network decoding (THTDNet)
This architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.
arXiv Detail & Related papers (2024-01-15T20:09:56Z) - Spatiotemporal Attention-based Semantic Compression for Real-time Video
Recognition [117.98023585449808]
We propose a temporal attention-based autoencoder (STAE) architecture to evaluate the importance of frames and pixels in each frame.
We develop a lightweight decoder that leverages a 3D-2D CNN combined to reconstruct missing information.
Experimental results show that ViT_STAE can compress the video dataset H51 by 104x with only 5% accuracy loss.
arXiv Detail & Related papers (2023-05-22T07:47:27Z) - Neighbourhood Representative Sampling for Efficient End-to-end Video
Quality Assessment [60.57703721744873]
The increased resolution of real-world videos presents a dilemma between efficiency and accuracy for deep Video Quality Assessment (VQA)
In this work, we propose a unified scheme, spatial-temporal grid mini-cube sampling (St-GMS) to get a novel type of sample, named fragments.
With fragments and FANet, the proposed efficient end-to-end FAST-VQA and FasterVQA achieve significantly better performance than existing approaches on all VQA benchmarks.
arXiv Detail & Related papers (2022-10-11T11:38:07Z) - Sliding Window Recurrent Network for Efficient Video Super-Resolution [0.0]
Video super-resolution (VSR) is the task of restoring high-resolution frames from a sequence of low-resolution inputs.
We propose a textitSliding Window based Recurrent Network (SWRN) which can be real-time inference while still achieving superior performance.
Our experiment on REDS dataset shows that the proposed method can be well adapted to mobile devices and produce visually pleasant results.
arXiv Detail & Related papers (2022-08-24T15:23:44Z) - STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution
Video Prediction [78.129039340528]
We propose a StemporalResidual Predictive Model (STRPM) for high-resolution video prediction.
STRPM can generate more satisfactory results compared with various existing methods.
Experimental results show that STRPM can generate more satisfactory results compared with various existing methods.
arXiv Detail & Related papers (2022-03-30T06:24:00Z) - iSeeBetter: Spatio-temporal video super-resolution using recurrent
generative back-projection networks [0.0]
We present iSeeBetter, a novel GAN-based structural-temporal approach to video super-resolution (VSR)
iSeeBetter extracts spatial and temporal information from the current and neighboring frames using the concept of recurrent back-projection networks as its generator.
Our results demonstrate that iSeeBetter offers superior VSR fidelity and surpasses state-of-the-art performance.
arXiv Detail & Related papers (2020-06-13T01:36:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.