Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity
and Temporal-Consistency Video Prediction
- URL: http://arxiv.org/abs/2002.09905v2
- Date: Fri, 22 May 2020 14:46:22 GMT
- Title: Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity
and Temporal-Consistency Video Prediction
- Authors: Beibei Jin, Yu Hu, Qiankun Tang, Jingyu Niu, Zhiping Shi, Yinhe Han,
Xiaowei Li
- Abstract summary: We propose a video prediction network based on multi-level wavelet analysis to deal with spatial and temporal information in a unified manner.
Our model shows significant improvements on fidelity and temporal consistency over state-of-the-art works.
- Score: 12.84409065286371
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video prediction is a pixel-wise dense prediction task to infer future frames
based on past frames. Missing appearance details and motion blur are still two
major problems for current predictive models, which lead to image distortion
and temporal inconsistency. In this paper, we point out the necessity of
exploring multi-frequency analysis to deal with the two problems. Inspired by
the frequency band decomposition characteristic of Human Vision System (HVS),
we propose a video prediction network based on multi-level wavelet analysis to
deal with spatial and temporal information in a unified manner. Specifically,
the multi-level spatial discrete wavelet transform decomposes each video frame
into anisotropic sub-bands with multiple frequencies, helping to enrich
structural information and reserve fine details. On the other hand, multi-level
temporal discrete wavelet transform which operates on time axis decomposes the
frame sequence into sub-band groups of different frequencies to accurately
capture multi-frequency motions under a fixed frame rate. Extensive experiments
on diverse datasets demonstrate that our model shows significant improvements
on fidelity and temporal consistency over state-of-the-art works.
Related papers
- Frequency Guidance Matters: Skeletal Action Recognition by Frequency-Aware Mixed Transformer [18.459822172890473]
We introduce a frequency-aware attention module to unweave skeleton frequency representations.
We also develop a mixed transformer architecture to incorporate spatial features with frequency features.
Experiments show that FreqMiXFormer outperforms SOTA on 3 popular skeleton recognition datasets.
arXiv Detail & Related papers (2024-07-17T05:47:27Z) - MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z) - Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe.
Recent deep learning-based VAD models have shown promising results by generating high-resolution frames.
We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z) - Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience.
Existing methods have achieved great success by employing advanced motion models and synthesis networks.
WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z) - MultiWave: Multiresolution Deep Architectures through Wavelet
Decomposition for Multivariate Time Series Prediction [6.980076213134384]
MultiWave is a novel framework that enhances deep learning time series models by incorporating components that operate at the intrinsic frequencies of signals.
We show that MultiWave consistently identifies critical features and their frequency components, thus providing valuable insights into the applications studied.
arXiv Detail & Related papers (2023-06-16T20:07:15Z) - Motion and Context-Aware Audio-Visual Conditioned Video Prediction [58.9467115916639]
We decouple the audio-visual conditioned video prediction into motion and appearance modeling.
The multimodal motion estimation predicts future optical flow based on the audio-motion correlation.
We propose context-aware refinement to address the diminishing of the global appearance context.
arXiv Detail & Related papers (2022-12-09T05:57:46Z) - Towards Interpretable Video Super-Resolution via Alternating
Optimization [115.85296325037565]
We study a practical space-time video super-resolution (STVSR) problem which aims at generating a high-framerate high-resolution sharp video from a low-framerate blurry video.
We propose an interpretable STVSR framework by leveraging both model-based and learning-based methods.
arXiv Detail & Related papers (2022-07-21T21:34:05Z) - Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in
VIS and NIR Scenario [87.72258480670627]
Existing face forgery detection methods based on frequency domain find that the GAN forged images have obvious grid-like visual artifacts in the frequency spectrum compared to the real images.
This paper proposes a Cosine Transform-based Forgery Clue Augmentation Network (FCAN-DCT) to achieve a more comprehensive spatial-temporal feature representation.
arXiv Detail & Related papers (2022-07-05T09:27:53Z) - Convolutional Transformer based Dual Discriminator Generative
Adversarial Networks for Video Anomaly Detection [27.433162897608543]
We propose Conversaal Transformer based Dual Discriminator Generative Adrial Networks (CT-D2GAN) to perform unsupervised video anomaly detection.
It contains three key components, i., a convolutional encoder to capture the spatial information of input clips, a temporal self-attention module to encode the temporal dynamics and predict the future frame.
arXiv Detail & Related papers (2021-07-29T03:07:25Z) - WaveFill: A Wavelet-based Generation Network for Image Inpainting [57.012173791320855]
WaveFill is a wavelet-based inpainting network that decomposes images into multiple frequency bands.
WaveFill decomposes images by using discrete wavelet transform (DWT) that preserves spatial information naturally.
It applies L1 reconstruction loss to the low-frequency bands and adversarial loss to high-frequency bands, hence effectively mitigate inter-frequency conflicts.
arXiv Detail & Related papers (2021-07-23T04:44:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.