Long-horizon video prediction using a dynamic latent hierarchy
- URL: http://arxiv.org/abs/2212.14376v1
- Date: Thu, 29 Dec 2022 17:19:28 GMT
- Title: Long-horizon video prediction using a dynamic latent hierarchy
- Authors: Alexey Zakharov, Qinghai Guo, Zafeirios Fountas
- Abstract summary: We introduce Dynamic Latent (DLH) -- a latent model that represents videos as a hierarchy of latent states.
DLH learns to disentangle representations across its hierarchy.
We demonstrate that DLH outperforms state-of-the-art benchmarks in video prediction.
- Score: 1.2891210250935146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of video prediction and generation is known to be notoriously
difficult, with the research in this area largely limited to short-term
predictions. Though plagued with noise and stochasticity, videos consist of
features that are organised in a spatiotemporal hierarchy, different features
possessing different temporal dynamics. In this paper, we introduce Dynamic
Latent Hierarchy (DLH) -- a deep hierarchical latent model that represents
videos as a hierarchy of latent states that evolve over separate and fluid
timescales. Each latent state is a mixture distribution with two components,
representing the immediate past and the predicted future, causing the model to
learn transitions only between sufficiently dissimilar states, while clustering
temporally persistent states closer together. Using this unique property, DLH
naturally discovers the spatiotemporal structure of a dataset and learns
disentangled representations across its hierarchy. We hypothesise that this
simplifies the task of modeling temporal dynamics of a video, improves the
learning of long-term dependencies, and reduces error accumulation. As
evidence, we demonstrate that DLH outperforms state-of-the-art benchmarks in
video prediction, is able to better represent stochasticity, as well as to
dynamically adjust its hierarchical and temporal structure. Our paper shows,
among other things, how progress in representation learning can translate into
progress in prediction tasks.
Related papers
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - Skeleton-Based Action Segmentation with Multi-Stage Spatial-Temporal
Graph Convolutional Neural Networks [0.5156484100374059]
State-of-the-art action segmentation approaches use multiple stages of temporal convolutions.
We present multi-stage spatial-temporal graph convolutional neural networks (MS-GCN)
We replace the initial stage of temporal convolutions with spatial-temporal graph convolutions, which better exploit the spatial configuration of the joints.
arXiv Detail & Related papers (2022-02-03T17:42:04Z) - Variational Predictive Routing with Nested Subjective Timescales [1.6114012813668934]
We present Variational Predictive Routing (PRV) - a neural inference system that organizes latent video features in a temporal hierarchy.
We show that VPR is able to detect event boundaries, disentangletemporal features, adapt to the dynamics hierarchy of the data, and produce accurate time-agnostic rollouts of the future.
arXiv Detail & Related papers (2021-10-21T16:12:59Z) - ModeRNN: Harnessing Spatiotemporal Mode Collapse in Unsupervised
Predictive Learning [75.2748374360642]
We propose ModeRNN, which introduces a novel method to learn hidden structured representations between recurrent states.
Across the entire dataset, different modes result in different responses on the mixtures of slots, which enhances the ability of ModeRNN to build structured representations.
arXiv Detail & Related papers (2021-10-08T03:47:54Z) - Simple Video Generation using Neural ODEs [9.303957136142293]
We learn latent variable models that predict the future in latent space and project back to pixels.
We show that our approach yields promising results in the task of future frame prediction on the Moving MNIST dataset with 1 and 2 digits.
arXiv Detail & Related papers (2021-09-07T19:03:33Z) - Interpretable Time-series Representation Learning With Multi-Level
Disentanglement [56.38489708031278]
Disentangle Time Series (DTS) is a novel disentanglement enhancement framework for sequential data.
DTS generates hierarchical semantic concepts as the interpretable and disentangled representation of time-series.
DTS achieves superior performance in downstream applications, with high interpretability of semantic concepts.
arXiv Detail & Related papers (2021-05-17T22:02:24Z) - Local Frequency Domain Transformer Networks for Video Prediction [24.126513851779936]
Video prediction is of interest not only in anticipating visual changes in the real world but has, above all, emerged as an unsupervised learning rule.
This paper proposes a fully differentiable building block that can perform all of those tasks separately while maintaining interpretability.
arXiv Detail & Related papers (2021-05-10T19:48:42Z) - Revisiting Hierarchical Approach for Persistent Long-Term Video
Prediction [55.4498466252522]
We set a new standard of video prediction with orders of magnitude longer prediction time than existing approaches.
Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation.
We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon.
arXiv Detail & Related papers (2021-04-14T08:39:38Z) - Learning Temporal Dynamics from Cycles in Narrated Video [85.89096034281694]
We propose a self-supervised solution to the problem of learning to model how the world changes as time elapses.
Our model learns modality-agnostic functions to predict forward and backward in time, which must undo each other when composed.
We apply the learned dynamics model without further training to various tasks, such as predicting future action and temporally ordering sets of images.
arXiv Detail & Related papers (2021-01-07T02:41:32Z) - Unsupervised Video Decomposition using Spatio-temporal Iterative
Inference [31.97227651679233]
Multi-object scene decomposition is a fast-emerging problem in learning.
We show that our model has a high accuracy even without color information.
We demonstrate the decomposition, segmentation prediction capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets.
arXiv Detail & Related papers (2020-06-25T22:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.