Learning Disentangled Representations of Video with Missing Data
- URL: http://arxiv.org/abs/2006.13391v2
- Date: Tue, 3 Nov 2020 20:56:04 GMT
- Title: Learning Disentangled Representations of Video with Missing Data
- Authors: Armand Comas-Massagu\'e, Chi Zhang, Zlatan Feric, Octavia Camps, Rose
Yu
- Abstract summary: We present Disentangled Imputed Video autoEncoder (DIVE), a deep generative model that imputes and predicts future video frames in the presence of missing data.
Specifically, DIVE introduces a missingness latent variable, disentangles the hidden video representations into static and dynamic appearance, pose, and missingness factors for each object.
On a moving MNIST dataset with various missing scenarios, DIVE outperforms the state of the art baselines by a substantial margin.
- Score: 17.34839550557689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Missing data poses significant challenges while learning representations of
video sequences. We present Disentangled Imputed Video autoEncoder (DIVE), a
deep generative model that imputes and predicts future video frames in the
presence of missing data. Specifically, DIVE introduces a missingness latent
variable, disentangles the hidden video representations into static and dynamic
appearance, pose, and missingness factors for each object. DIVE imputes each
object's trajectory where data is missing. On a moving MNIST dataset with
various missing scenarios, DIVE outperforms the state of the art baselines by a
substantial margin. We also present comparisons for real-world MOTSChallenge
pedestrian dataset, which demonstrates the practical value of our method in a
more realistic setting. Our code and data can be found at
https://github.com/Rose-STL-Lab/DIVE.
Related papers
- Towards Student Actions in Classroom Scenes: New Dataset and Baseline [43.268586725768465]
We present a new multi-label student action video (SAV) dataset for complex classroom scenes.
The dataset consists of 4,324 carefully trimmed video clips from 758 different classrooms, each labeled with 15 different actions displayed by students in classrooms.
arXiv Detail & Related papers (2024-09-02T03:44:24Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs [64.60035916955837]
VANE-Bench is a benchmark designed to assess the proficiency of Video-LMMs in detecting anomalies and inconsistencies in videos.
Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models.
We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies.
arXiv Detail & Related papers (2024-06-14T17:59:01Z) - Kick Back & Relax++: Scaling Beyond Ground-Truth Depth with SlowTV &
CribsTV [50.616892315086574]
This paper proposes two novel datasets: SlowTV and CribsTV.
These are large-scale datasets curated from publicly available YouTube videos, containing a total of 2M training frames.
We leverage these datasets to tackle the challenging task of zero-shot generalization.
arXiv Detail & Related papers (2024-03-03T17:29:03Z) - Visual Data-Type Understanding does not emerge from Scaling
Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification.
An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes.
We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature.
We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z) - Emotion Recognition on large video dataset based on Convolutional
Feature Extractor and Recurrent Neural Network [0.2855485723554975]
Our model combines convolutional neural network (CNN) with recurrent neural network (RNN) to predict dimensional emotions on video data.
Experiments are performed on publicly available datasets including the largest modern Aff-Wild2 database.
arXiv Detail & Related papers (2020-06-19T14:54:13Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.