Related papers: Benefits of Feature Extraction and Temporal Sequence Analysis for Video Frame Prediction: An Evaluation of Hybrid Deep Learning Models

Benefits of Feature Extraction and Temporal Sequence Analysis for Video Frame Prediction: An Evaluation of Hybrid Deep Learning Models

URL: http://arxiv.org/abs/2508.00898v1
Date: Mon, 28 Jul 2025 10:07:00 GMT
Title: Benefits of Feature Extraction and Temporal Sequence Analysis for Video Frame Prediction: An Evaluation of Hybrid Deep Learning Models
Authors: Jose M. Sánchez Velázquez, Mingbo Cai, Andrew Coney, Álvaro J. García- Tejedor, Alberto Nogales,
Abstract summary: Video frame prediction has critical applications in weather forecasting or autonomous systems.<n>This paper evaluates several hybrid deep learning approaches that combine the feature extraction capabilities of autoencoders with temporal sequence modelling.
Score: 3.7049613588433497
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, advances in Artificial Intelligence have significantly impacted computer science, particularly in the field of computer vision, enabling solutions to complex problems such as video frame prediction. Video frame prediction has critical applications in weather forecasting or autonomous systems and can provide technical improvements, such as video compression and streaming. Among Artificial Intelligence methods, Deep Learning has emerged as highly effective for solving vision-related tasks, although current frame prediction models still have room for enhancement. This paper evaluates several hybrid deep learning approaches that combine the feature extraction capabilities of autoencoders with temporal sequence modelling using Recurrent Neural Networks (RNNs), 3D Convolutional Neural Networks (3D CNNs), and related architectures. The proposed solutions were rigorously evaluated on three datasets that differ in terms of synthetic versus real-world scenarios and grayscale versus color imagery. Results demonstrate that the approaches perform well, with SSIM metrics increasing from 0.69 to 0.82, indicating that hybrid models utilizing 3DCNNs and ConvLSTMs are the most effective, and greyscale videos with real data are the easiest to predict.

Related papers

Lightweight Stochastic Video Prediction via Hybrid Warping [10.448675566568086]
Accurate video prediction by deep neural networks, especially for dynamic regions, is a challenging task in computer vision for critical applications such as autonomous driving, remote working, and telemedicine.<n>We propose a novel long-term complexity video prediction model that focuses on dynamic regions by employing a hybrid warping strategy.<n>Considering real-time predictions, we introduce a MobileNet-based lightweight architecture into our model.
arXiv Detail & Related papers (2024-12-04T06:33:27Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Computer Vision Model Compression Techniques for Embedded Systems: A Survey [75.38606213726906]
This paper covers the main model compression techniques applied for computer vision tasks. We present the characteristics of compression subareas, compare different approaches, and discuss how to choose the best technique. We also share codes to assist researchers and new practitioners in overcoming initial implementation challenges.
arXiv Detail & Related papers (2024-08-15T16:41:55Z)
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies. Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks. Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z)
Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity. Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z)
Conditional Generative Modeling for Images, 3D Animations, and Video [4.422441608136163]
dissertation attempts to drive innovation in the field of generative modeling for computer vision. Research focuses on architectures that offer transformations of noise and visual data, and the application of encoder-decoder architectures for generative tasks and 3D content manipulation.
arXiv Detail & Related papers (2023-10-19T21:10:39Z)
Neural Textured Deformable Meshes for Robust Analysis-by-Synthesis [17.920305227880245]
Our paper formulates triple vision tasks in a consistent manner using approximate analysis-by-synthesis. We show that our analysis-by-synthesis is much more robust than conventional neural networks when evaluated on real-world images.
arXiv Detail & Related papers (2023-05-31T18:45:02Z)
Convolution, aggregation and attention based deep neural networks for accelerating simulations in mechanics [1.0154623955833253]
We demonstrate three types of neural network architectures for efficient learning of deformations of solid bodies. The first two are based on the recently proposed CNN U-NET and MAgNET frameworks which have shown promising performance for learning on mesh-based data. The third architecture is Perceiver IO, a very recent architecture that belongs to the family of attention-based neural networks.
arXiv Detail & Related papers (2022-12-01T13:10:56Z)
CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z)
Interactive Analysis of CNN Robustness [11.136837582678869]
Perturber is a web-based application that allows users to explore how CNN activations and predictions evolve when a 3D input scene is interactively perturbed. Perturber offers a large variety of scene modifications, such as camera controls, lighting and shading effects, background modifications, object morphing, as well as adversarial attacks. Case studies with machine learning experts have shown that Perturber helps users to quickly generate hypotheses about model vulnerabilities and to qualitatively compare model behavior.
arXiv Detail & Related papers (2021-10-14T18:52:39Z)
Scene Synthesis via Uncertainty-Driven Attribute Synchronization [52.31834816911887]
This paper introduces a novel neural scene synthesis approach that can capture diverse feature patterns of 3D scenes. Our method combines the strength of both neural network-based and conventional scene synthesis approaches.
arXiv Detail & Related papers (2021-08-30T19:45:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.