ModeRNN: Harnessing Spatiotemporal Mode Collapse in Unsupervised
Predictive Learning
- URL: http://arxiv.org/abs/2110.03882v1
- Date: Fri, 8 Oct 2021 03:47:54 GMT
- Title: ModeRNN: Harnessing Spatiotemporal Mode Collapse in Unsupervised
Predictive Learning
- Authors: Zhiyu Yao, Yunbo Wang, Haixu Wu, Jianmin Wang, Mingsheng Long
- Abstract summary: We propose ModeRNN, which introduces a novel method to learn hidden structured representations between recurrent states.
Across the entire dataset, different modes result in different responses on the mixtures of slots, which enhances the ability of ModeRNN to build structured representations.
- Score: 75.2748374360642
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning predictive models for unlabeled spatiotemporal data is challenging
in part because visual dynamics can be highly entangled in real scenes, making
existing approaches prone to overfit partial modes of physical processes while
neglecting to reason about others. We name this phenomenon spatiotemporal mode
collapse and explore it for the first time in predictive learning. The key is
to provide the model with a strong inductive bias to discover the compositional
structures of latent modes. To this end, we propose ModeRNN, which introduces a
novel method to learn structured hidden representations between recurrent
states. The core idea of this framework is to first extract various components
of visual dynamics using a set of spatiotemporal slots with independent
parameters. Considering that multiple space-time patterns may co-exist in a
sequence, we leverage learnable importance weights to adaptively aggregate slot
features into a unified hidden representation, which is then used to update the
recurrent states. Across the entire dataset, different modes result in
different responses on the mixtures of slots, which enhances the ability of
ModeRNN to build structured representations and thus prevents the so-called
mode collapse. Unlike existing models, ModeRNN is shown to prevent
spatiotemporal mode collapse and further benefit from learning mixed visual
dynamics.
Related papers
- Foundational Inference Models for Dynamical Systems [5.549794481031468]
We offer a fresh perspective on the classical problem of imputing missing time series data, whose underlying dynamics are assumed to be determined by ODEs.
We propose a novel supervised learning framework for zero-shot time series imputation, through parametric functions satisfying some (hidden) ODEs.
We empirically demonstrate that one and the same (pretrained) recognition model can perform zero-shot imputation across 63 distinct time series with missing values.
arXiv Detail & Related papers (2024-02-12T11:48:54Z) - OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive
Learning [67.07363529640784]
We propose OpenSTL to categorize prevalent approaches into recurrent-based and recurrent-free models.
We conduct standard evaluations on datasets across various domains, including synthetic moving object trajectory, human motion, driving scenes, traffic flow and forecasting weather.
We find that recurrent-free models achieve a good balance between efficiency and performance than recurrent models.
arXiv Detail & Related papers (2023-06-20T03:02:14Z) - Anamnesic Neural Differential Equations with Orthogonal Polynomial
Projections [6.345523830122166]
We propose PolyODE, a formulation that enforces long-range memory and preserves a global representation of the underlying dynamical system.
Our construction is backed by favourable theoretical guarantees and we demonstrate that it outperforms previous works in the reconstruction of past and future data.
arXiv Detail & Related papers (2023-03-03T10:49:09Z) - Recur, Attend or Convolve? Frame Dependency Modeling Matters for
Cross-Domain Robustness in Action Recognition [0.5448283690603357]
Previous results have shown that 2D Convolutional Neural Networks (CNNs) tend to be biased toward texture rather than shape for various computer vision tasks.
This raises suspicion that large video models learn spurious correlations rather than to track relevant shapes over time.
We study the cross-domain robustness for recurrent, attention-based and convolutional video models, respectively, to investigate whether this robustness is influenced by the frame dependency modeling.
arXiv Detail & Related papers (2021-12-22T19:11:53Z) - Closed-form Continuous-Depth Models [99.40335716948101]
Continuous-depth neural models rely on advanced numerical differential equation solvers.
We present a new family of models, termed Closed-form Continuous-depth (CfC) networks, that are simple to describe and at least one order of magnitude faster.
arXiv Detail & Related papers (2021-06-25T22:08:51Z) - Learning Temporal Dynamics from Cycles in Narrated Video [85.89096034281694]
We propose a self-supervised solution to the problem of learning to model how the world changes as time elapses.
Our model learns modality-agnostic functions to predict forward and backward in time, which must undo each other when composed.
We apply the learned dynamics model without further training to various tasks, such as predicting future action and temporally ordering sets of images.
arXiv Detail & Related papers (2021-01-07T02:41:32Z) - S2RMs: Spatially Structured Recurrent Modules [105.0377129434636]
We take a step towards exploiting dynamic structure that are capable of simultaneously exploiting both modular andtemporal structures.
We find our models to be robust to the number of available views and better capable of generalization to novel tasks without additional training.
arXiv Detail & Related papers (2020-07-13T17:44:30Z) - Unsupervised Video Decomposition using Spatio-temporal Iterative
Inference [31.97227651679233]
Multi-object scene decomposition is a fast-emerging problem in learning.
We show that our model has a high accuracy even without color information.
We demonstrate the decomposition, segmentation prediction capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets.
arXiv Detail & Related papers (2020-06-25T22:57:17Z) - Predicting Temporal Sets with Deep Neural Networks [50.53727580527024]
We propose an integrated solution based on the deep neural networks for temporal sets prediction.
A unique perspective is to learn element relationship by constructing set-level co-occurrence graph.
We design an attention-based module to adaptively learn the temporal dependency of elements and sets.
arXiv Detail & Related papers (2020-06-20T03:29:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.