Real-time Video Prediction With Fast Video Interpolation Model and Prediction Training
- URL: http://arxiv.org/abs/2503.23185v2
- Date: Fri, 04 Apr 2025 14:29:20 GMT
- Title: Real-time Video Prediction With Fast Video Interpolation Model and Prediction Training
- Authors: Shota Hirose, Kazuki Kotoyori, Kasidis Arunruangsirilert, Fangzheng Lin, Heming Sun, Jiro Katto,
- Abstract summary: We propose real-time video prediction towards the zero-latency interaction over networks, called IFRVP.<n>We introduce ELAN-based residual blocks into the prediction models to improve both inference speed and accuracy.<n>Our evaluations show that our proposed models perform efficiently and achieve the best trade-off between prediction accuracy and computational speed.
- Score: 9.225628670664596
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transmission latency significantly affects users' quality of experience in real-time interaction and actuation. As latency is principally inevitable, video prediction can be utilized to mitigate the latency and ultimately enable zero-latency transmission. However, most of the existing video prediction methods are computationally expensive and impractical for real-time applications. In this work, we therefore propose real-time video prediction towards the zero-latency interaction over networks, called IFRVP (Intermediate Feature Refinement Video Prediction). Firstly, we propose three training methods for video prediction that extend frame interpolation models, where we utilize a simple convolution-only frame interpolation network based on IFRNet. Secondly, we introduce ELAN-based residual blocks into the prediction models to improve both inference speed and accuracy. Our evaluations show that our proposed models perform efficiently and achieve the best trade-off between prediction accuracy and computational speed among the existing video prediction methods. A demonstration movie is also provided at http://bit.ly/IFRVPDemo. The code will be released at https://github.com/FykAikawa/IFRVP.
Related papers
- Video Prediction Transformers without Recurrence or Convolution [65.93130697098658]
We propose PredFormer, a framework entirely based on Gated Transformers.<n>We provide a comprehensive analysis of 3D Attention in the context of video prediction.<n>The significant improvements in both accuracy and efficiency highlight the potential of PredFormer.
arXiv Detail & Related papers (2024-10-07T03:52:06Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - TKN: Transformer-based Keypoint Prediction Network For Real-time Video
Prediction [16.294105130947]
We propose a transformer-based keypoint prediction neural network (TKN) for video prediction.
TKN is an unsupervised learning method that boost the prediction process via constrained information extraction and parallel prediction scheme.
Extensive experiments on KTH and Human3.6 datasets demonstrate that TKN predicts 11 times faster than existing methods.
arXiv Detail & Related papers (2023-03-17T07:26:16Z) - StreamYOLO: Real-time Object Detection for Streaming Perception [84.2559631820007]
We endow the models with the capacity of predicting the future, significantly improving the results for streaming perception.
We consider multiple velocities driving scene and propose Velocity-awared streaming AP (VsAP) to jointly evaluate the accuracy.
Our simple method achieves the state-of-the-art performance on Argoverse-HD dataset and improves the sAP and VsAP by 4.7% and 8.2% respectively.
arXiv Detail & Related papers (2022-07-21T12:03:02Z) - Optimizing Video Prediction via Video Frame Interpolation [53.16726447796844]
We present a new optimization framework for video prediction via video frame, inspired by photo-realistic results of video framescapes.
Our framework is based on optimization with a pretrained differentiable video frame module without the need for a training dataset.
Our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
arXiv Detail & Related papers (2022-06-27T17:03:46Z) - VPTR: Efficient Transformers for Video Prediction [14.685237010856953]
We propose a new Transformer block for video future frames prediction based on an efficient local spatial-temporal separation attention mechanism.
Based on this new Transformer block, a fully autoregressive video future frames prediction Transformer is proposed.
A non-autoregressive video prediction Transformer is also proposed to increase the inference speed and reduce the accumulated inference errors of its autoregressive counterpart.
arXiv Detail & Related papers (2022-03-29T18:09:09Z) - Long-term Video Frame Interpolation via Feature Propagation [95.18170372022703]
Video frame (VFI) works generally predict intermediate frame(s) by first estimating the motion between inputs and then warping the inputs to the target time with the estimated motion.
This approach is not optimal when the temporal distance between the input sequence increases.
We propose a propagation network (PNet) by extending the classic feature-level forecasting with a novel motion-to-feature approach.
arXiv Detail & Related papers (2022-03-29T10:47:06Z) - Real-time Object Detection for Streaming Perception [84.2559631820007]
Streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception.
We build a simple and effective framework for streaming perception.
Our method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline.
arXiv Detail & Related papers (2022-03-23T11:33:27Z) - Video Prediction at Multiple Scales with Hierarchical Recurrent Networks [24.536256844130996]
We propose a novel video prediction model able to forecast future possible outcomes of different levels of granularity simultaneously.
By combining spatial and temporal downsampling, MSPred is able to efficiently predict abstract representations over long time horizons.
In our experiments, we demonstrate that our proposed model accurately predicts future video frames as well as other representations on various scenarios.
arXiv Detail & Related papers (2022-03-17T13:08:28Z) - Fourier-based Video Prediction through Relational Object Motion [28.502280038100167]
Deep recurrent architectures have been applied to the task of video prediction.
Here, we explore a different approach by using frequency-domain approaches for video prediction.
The resulting predictions are consistent with the observed dynamics in a scene and do not suffer from blur.
arXiv Detail & Related papers (2021-10-12T10:43:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.