Visual Representation Learning with Stochastic Frame Prediction
- URL: http://arxiv.org/abs/2406.07398v2
- Date: Thu, 8 Aug 2024 19:48:10 GMT
- Title: Visual Representation Learning with Stochastic Frame Prediction
- Authors: Huiwon Jang, Dongyoung Kim, Junsu Kim, Jinwoo Shin, Pieter Abbeel, Younggyo Seo,
- Abstract summary: This paper revisits the idea of video generation that learns to capture uncertainty in frame prediction.
We design a framework that trains a frame prediction model to learn temporal information between frames.
We find this architecture allows for combining both objectives in a synergistic and compute-efficient manner.
- Score: 90.99577838303297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning of image representations by predicting future frames is a promising direction but still remains a challenge. This is because of the under-determined nature of frame prediction; multiple potential futures can arise from a single current frame. To tackle this challenge, in this paper, we revisit the idea of stochastic video generation that learns to capture uncertainty in frame prediction and explore its effectiveness for representation learning. Specifically, we design a framework that trains a stochastic frame prediction model to learn temporal information between frames. Moreover, to learn dense information within each frame, we introduce an auxiliary masked image modeling objective along with a shared decoder architecture. We find this architecture allows for combining both objectives in a synergistic and compute-efficient manner. We demonstrate the effectiveness of our framework on a variety of tasks from video label propagation and vision-based robot learning domains, such as video segmentation, pose tracking, vision-based robotic locomotion, and manipulation tasks. Code is available on the project webpage: https://sites.google.com/view/2024rsp.
Related papers
- Video In-context Learning [46.40277880351059]
In this paper, we study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences.
To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets.
We design various evaluation metrics, including both objective and subjective measures, to demonstrate the visual quality and semantic accuracy of generation results.
arXiv Detail & Related papers (2024-07-10T04:27:06Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Neural Rendering of Humans in Novel View and Pose from Monocular Video [68.37767099240236]
We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input.
Our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input.
arXiv Detail & Related papers (2022-04-04T03:09:20Z) - Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks.
We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames.
We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z) - Efficient training for future video generation based on hierarchical
disentangled representation of latent variables [66.94698064734372]
We propose a novel method for generating future prediction videos with less memory usage than the conventional methods.
We achieve high-efficiency by training our method in two stages: (1) image reconstruction to encode video frames into latent variables, and (2) latent variable prediction to generate the future sequence.
Our experiments show that the proposed method can efficiently generate future prediction videos, even for complex datasets that cannot be handled by previous methods.
arXiv Detail & Related papers (2021-06-07T10:43:23Z) - Local Frequency Domain Transformer Networks for Video Prediction [24.126513851779936]
Video prediction is of interest not only in anticipating visual changes in the real world but has, above all, emerged as an unsupervised learning rule.
This paper proposes a fully differentiable building block that can perform all of those tasks separately while maintaining interpretability.
arXiv Detail & Related papers (2021-05-10T19:48:42Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z) - Mutual Information Based Method for Unsupervised Disentanglement of
Video Representation [0.0]
Video prediction models have found prospective applications in Maneuver Planning, Health care, Autonomous Navigation and Simulation.
One of the major challenges in future frame generation is due to the high dimensional nature of visual data.
We propose Mutual Information Predictive Auto-Encoder framework, that reduces the task of predicting high dimensional video frames.
arXiv Detail & Related papers (2020-11-17T13:16:07Z) - Unsupervised Video Representation Learning by Bidirectional Feature
Prediction [16.074111448606512]
This paper introduces a novel method for self-supervised video representation learning via feature prediction.
We argue that a supervisory signal arising from unobserved past frames is complementary to one that originates from the future frames.
We empirically show that utilizing both signals enriches the learned representations for the downstream task of action recognition.
arXiv Detail & Related papers (2020-11-11T19:42:31Z) - Motion Segmentation using Frequency Domain Transformer Networks [29.998917158604694]
We propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately.
Our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.
arXiv Detail & Related papers (2020-04-18T15:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.