Unsupervised Video Representation Learning by Bidirectional Feature
Prediction
- URL: http://arxiv.org/abs/2011.06037v1
- Date: Wed, 11 Nov 2020 19:42:31 GMT
- Title: Unsupervised Video Representation Learning by Bidirectional Feature
Prediction
- Authors: Nadine Behrmann and Juergen Gall and Mehdi Noroozi
- Abstract summary: This paper introduces a novel method for self-supervised video representation learning via feature prediction.
We argue that a supervisory signal arising from unobserved past frames is complementary to one that originates from the future frames.
We empirically show that utilizing both signals enriches the learned representations for the downstream task of action recognition.
- Score: 16.074111448606512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel method for self-supervised video representation
learning via feature prediction. In contrast to the previous methods that focus
on future feature prediction, we argue that a supervisory signal arising from
unobserved past frames is complementary to one that originates from the future
frames. The rationale behind our method is to encourage the network to explore
the temporal structure of videos by distinguishing between future and past
given present observations. We train our model in a contrastive learning
framework, where joint encoding of future and past provides us with a
comprehensive set of temporal hard negatives via swapping. We empirically show
that utilizing both signals enriches the learned representations for the
downstream task of action recognition. It outperforms independent prediction of
future and past.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - Visual Representation Learning with Stochastic Frame Prediction [90.99577838303297]
This paper revisits the idea of video generation that learns to capture uncertainty in frame prediction.
We design a framework that trains a frame prediction model to learn temporal information between frames.
We find this architecture allows for combining both objectives in a synergistic and compute-efficient manner.
arXiv Detail & Related papers (2024-06-11T16:05:15Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Inductive Attention for Video Action Anticipation [16.240254363118016]
We propose an inductive attention model, dubbed IAM, which leverages the current prior predictions as the query to infer future action.
Our method consistently outperforms the state-of-the-art anticipation models on multiple large-scale egocentric video datasets.
arXiv Detail & Related papers (2022-12-17T09:51:17Z) - Unified Recurrence Modeling for Video Action Anticipation [16.240254363118016]
We propose a unified recurrence modeling for video action anticipation via message passing framework.
Our proposed method outperforms previous works on the large-scale EPIC-Kitchen dataset.
arXiv Detail & Related papers (2022-06-02T12:16:44Z) - Stochastic Coherence Over Attention Trajectory For Continuous Learning
In Video Streams [64.82800502603138]
This paper proposes a novel neural-network-based approach to progressively and autonomously develop pixel-wise representations in a video stream.
The proposed method is based on a human-like attention mechanism that allows the agent to learn by observing what is moving in the attended locations.
Our experiments leverage 3D virtual environments and they show that the proposed agents can learn to distinguish objects just by observing the video stream.
arXiv Detail & Related papers (2022-04-26T09:52:31Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Fourier-based Video Prediction through Relational Object Motion [28.502280038100167]
Deep recurrent architectures have been applied to the task of video prediction.
Here, we explore a different approach by using frequency-domain approaches for video prediction.
The resulting predictions are consistent with the observed dynamics in a scene and do not suffer from blur.
arXiv Detail & Related papers (2021-10-12T10:43:05Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.