S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction
- URL: http://arxiv.org/abs/2307.06701v3
- Date: Tue, 19 Nov 2024 13:09:06 GMT
- Title: S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction
- Authors: Mohammad Adiban, Kalin Stefanov, Sabato Marco Siniscalchi, Giampiero Salvi,
- Abstract summary: We put forth a novel model that combines a novel residual vector learning quantized variational autoencoder (HR-VQE) and a hierarchical autoregressive vector predictive model (AST-PM)
We show that our model compares favorably against state-of-the-art video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size.
- Score: 16.14728977379756
- License:
- Abstract: We address the video prediction task by putting forth a novel model that combines (i) a novel hierarchical residual learning vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel autoregressive spatiotemporal predictive model (AST-PM). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the AST-PM's ability to handle spatiotemporal information, S-HR-VQVAE can better deal with major challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on four challenging tasks, namely KTH Human Action, TrafficBJ, Human3.6M, and Kitti, demonstrate that our model compares favorably against state-of-the-art video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and AST-PM parameters.
Related papers
- SalFoM: Dynamic Saliency Prediction with Video Foundation Models [37.25208752620703]
Video saliency prediction (VSP) has shown promising performance compared to the human visual system.
We introduce SalFoM, a novel encoder-decoder video transformer architecture.
Our model employs UnMasked Teacher (UMT) extractor and presents a heterogeneous decoder-aware informationtemporal transformer.
arXiv Detail & Related papers (2024-04-03T22:38:54Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Koopman Invertible Autoencoder: Leveraging Forward and Backward Dynamics
for Temporal Modeling [13.38194491846739]
We propose a novel machine learning model based on Koopman operator theory, which we call Koopman Invertible Autoencoders (KIA)
KIA captures the inherent characteristic of the system by modeling both forward and backward dynamics in the infinite-dimensional Hilbert space.
This enables us to efficiently learn low-dimensional representations, resulting in more accurate predictions of long-term system behavior.
arXiv Detail & Related papers (2023-09-19T03:42:55Z) - Spatio-Temporal Encoding of Brain Dynamics with Surface Masked Autoencoders [10.097983222759884]
Surface Masked AutoEncoder (sMAE) and surface Masked AutoEncoder (MAE)
These models are trained to reconstruct cortical feature maps from masked versions of the input by learning strong latent representations of cortical development and structure function.
Results show that (v)sMAE pre-trained models improve phenotyping prediction performance on multiple tasks by $ge 26%$, and offer faster convergence relative to models trained from scratch.
arXiv Detail & Related papers (2023-08-10T10:01:56Z) - Contextually Enhanced ES-dRNN with Dynamic Attention for Short-Term Load
Forecasting [1.1602089225841632]
The proposed model is composed of two simultaneously trained tracks: the context track and the main track.
The RNN architecture consists of multiple recurrent layers stacked with hierarchical dilations and equipped with recently proposed attentive recurrent cells.
The model produces both point forecasts and predictive intervals.
arXiv Detail & Related papers (2022-12-18T07:42:48Z) - IDM-Follower: A Model-Informed Deep Learning Method for Long-Sequence
Car-Following Trajectory Prediction [24.94160059351764]
Most car-following models are generative and only consider the inputs of the speed, position, and acceleration of the last time step.
We implement a novel structure with two independent encoders and a self-attention decoder that could sequentially predict the following trajectories.
Numerical experiments with multiple settings on simulation and NGSIM datasets show that the IDM-Follower can improve the prediction performance.
arXiv Detail & Related papers (2022-10-20T02:24:27Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Closed-form Continuous-Depth Models [99.40335716948101]
Continuous-depth neural models rely on advanced numerical differential equation solvers.
We present a new family of models, termed Closed-form Continuous-depth (CfC) networks, that are simple to describe and at least one order of magnitude faster.
arXiv Detail & Related papers (2021-06-25T22:08:51Z) - Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis
of Head and Prompt Tuning [66.44344616836158]
We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text.
We show that 1) under certain non-degeneracy conditions on the HMM, simple classification heads can solve the downstream task, 2) prompt tuning obtains downstream guarantees with weaker non-degeneracy conditions, and 3) our recovery guarantees for the memory-augmented HMM are stronger than for the vanilla HMM.
arXiv Detail & Related papers (2021-06-17T03:31:47Z) - DiscreTalk: Text-to-Speech as a Machine Translation Problem [52.33785857500754]
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT)
The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
arXiv Detail & Related papers (2020-05-12T02:45:09Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.