Related papers: Fast Fourier Inception Networks for Occluded Video Prediction

Fast Fourier Inception Networks for Occluded Video Prediction

URL: http://arxiv.org/abs/2306.10346v1
Date: Sat, 17 Jun 2023 13:27:29 GMT
Title: Fast Fourier Inception Networks for Occluded Video Prediction
Authors: Ping Li and Chenhan Zhang and Xianghua Xu
Abstract summary: Video prediction is a pixel-level task that generates future frames by employing the historical frames. We develop the fully convolutional Fast Fourier Networks for video prediction, termed itFFINet, which includes two primary components, ie, the occlusion inpainter and the translator.
Score: 16.99757795577547
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video prediction is a pixel-level task that generates future frames by employing the historical frames. There often exist continuous complex motions, such as object overlapping and scene occlusion in video, which poses great challenges to this task. Previous works either fail to well capture the long-term temporal dynamics or do not handle the occlusion masks. To address these issues, we develop the fully convolutional Fast Fourier Inception Networks for video prediction, termed \textit{FFINet}, which includes two primary components, \ie, the occlusion inpainter and the spatiotemporal translator. The former adopts the fast Fourier convolutions to enlarge the receptive field, such that the missing areas (occlusion) with complex geometric structures are filled by the inpainter. The latter employs the stacked Fourier transform inception module to learn the temporal evolution by group convolutions and the spatial movement by channel-wise Fourier convolutions, which captures both the local and the global spatiotemporal features. This encourages generating more realistic and high-quality future frames. To optimize the model, the recovery loss is imposed to the objective, \ie, minimizing the mean square error between the ground-truth frame and the recovery frame. Both quantitative and qualitative experimental results on five benchmarks, including Moving MNIST, TaxiBJ, Human3.6M, Caltech Pedestrian, and KTH, have demonstrated the superiority of the proposed approach. Our code is available at GitHub.

Related papers

StableDPT: Temporal Stable Monocular Video Depth Estimation [14.453483279783908]
We propose a novel approach that adapts any state-of-the-art image-based (depth) estimation model for video processing.<n>Our architecture builds upon an off-the-shelf Vision Transformer (ViT) encoder and enhances the Dense Prediction Transformer (DPT) head.<n> Evaluations on multiple benchmark datasets demonstrate improved temporal consistency, competitive state-of-the-art performance and on top 2x faster processing in real-world scenarios.
arXiv Detail & Related papers (2026-01-06T08:02:14Z)
Continuous Space-Time Video Super-Resolution with 3D Fourier Fields [62.270473766381976]
We introduce a novel formulation for continuous space-time video super-resolution.<n>We show that our modeling joint substantially improves both spatial and temporal super-resolution.
arXiv Detail & Related papers (2025-09-30T14:34:02Z)
Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos [27.595035122927204]
We present a feed-forward 4D human reconstruction and model that efficientlycalibrates temporally aligned representations from uncalibrated sparse-view videos.<n>For novel time, we design a novel motion prediction module to predict dense motions for each 3D Gaussian between two adjacent frames.<n>Experiments demonstrate the effectiveness of our model on both in-domain and out-of-domain datasets.
arXiv Detail & Related papers (2025-09-29T02:47:14Z)
ResidualViT for Efficient Temporally Dense Video Encoding [66.57779133786131]
We make three contributions to reduce the cost of computing features for temporally dense tasks.<n>First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos.<n>Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model.
arXiv Detail & Related papers (2025-09-16T17:12:23Z)
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model [39.24524388617938]
$mathbfMavors$ is a novel framework for holistic longvideo modeling. Mavors encodes raw video content into latent representations through two core components. The framework unifies image and video understanding by treating images as single-frame videos.
arXiv Detail & Related papers (2025-04-14T10:14:44Z)
FOF-X: Towards Real-time Detailed Human Reconstruction from a Single Image [68.84221452621674]
We introduce FOF-X for real-time reconstruction of detailed human geometry from a single image. FOF-X avoids the performance degradation caused by texture and lighting. We enhance the inter-conversion algorithms between FOF and mesh representations with a Laplacian constraint and an automaton-based discontinuity matcher.
arXiv Detail & Related papers (2024-12-08T14:46:29Z)
SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method. We distribute features of space-time tubes evenly across a limited number of learnable clusters. Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z)
Temporal Residual Jacobians For Rig-free Motion Transfer [45.640576754352104]
We introduce Residual Temporal Jacobians as a novel representation to enable data-driven motion transfer. Our approach does not assume access to any rigging or intermediate shapes, produces geometrically and temporally consistent motions, and can be used to transfer long motion sequences.
arXiv Detail & Related papers (2024-07-20T18:29:22Z)
ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement. In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams. To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z)
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation [50.49396123016185]
Video frame (VFI) aims to synthesize an intermediate frame between two consecutive frames. We propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI) Our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks.
arXiv Detail & Related papers (2022-07-19T03:37:49Z)
A Simple Baseline for Video Restoration with Grouped Spatial-temporal Shift [36.71578909392314]
In this study, we propose a simple yet effective framework for video restoration. Our approach is based on grouped spatial-temporal shift, which is a lightweight and straightforward technique. Our framework outperforms the previous state-of-the-art method, while using less than a quarter of its computational cost.
arXiv Detail & Related papers (2022-06-22T02:16:47Z)
Fourier PlenOctrees for Dynamic Radiance Field Rendering in Real-time [43.0484840009621]
Implicit neural representations such as Neural Radiance Field (NeRF) have focused mainly on modeling static objects captured under multi-view settings. We present a novel Fourier PlenOctree (FPO) technique to tackle efficient neural modeling and real-time rendering of dynamic scenes captured under the free-view video (FVV) setting. We show that the proposed method is 3000 times faster than the original NeRF and over an order of magnitude acceleration over SOTA.
arXiv Detail & Related papers (2022-02-17T11:57:01Z)
Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection [27.433162897608543]
We propose Conversaal Transformer based Dual Discriminator Generative Adrial Networks (CT-D2GAN) to perform unsupervised video anomaly detection. It contains three key components, i., a convolutional encoder to capture the spatial information of input clips, a temporal self-attention module to encode the temporal dynamics and predict the future frame.
arXiv Detail & Related papers (2021-07-29T03:07:25Z)
Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction [55.4498466252522]
We set a new standard of video prediction with orders of magnitude longer prediction time than existing approaches. Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation. We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon.
arXiv Detail & Related papers (2021-04-14T08:39:38Z)
Enhanced Quadratic Video Interpolation [56.54662568085176]
We propose an enhanced quadratic video (EQVI) model to handle more complicated scenes and motion patterns. To further boost the performance, we devise a novel multi-scale fusion network (MS-Fusion) which can be regarded as a learnable augmentation process. The proposed EQVI model won the first place in the AIM 2020 Video Temporal Super-Resolution Challenge.
arXiv Detail & Related papers (2020-09-10T02:31:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.