Temporally Coherent Embeddings for Self-Supervised Video Representation
Learning
- URL: http://arxiv.org/abs/2004.02753v5
- Date: Tue, 17 Nov 2020 04:21:35 GMT
- Title: Temporally Coherent Embeddings for Self-Supervised Video Representation
Learning
- Authors: Joshua Knights, Ben Harwood, Daniel Ward, Anthony Vanderkop, Olivia
Mackenzie-Ross, Peyman Moghadam
- Abstract summary: This paper presents TCE: Temporally Coherent Embeddings for self-supervised video representation learning.
The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space.
With a simple but effective 2D-CNN backbone and only RGB stream inputs, TCE pre-trained representations outperform all previous selfsupervised 2D-CNN and 3D-CNN pre-trained on UCF101.
- Score: 2.216657815393579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents TCE: Temporally Coherent Embeddings for self-supervised
video representation learning. The proposed method exploits inherent structure
of unlabeled video data to explicitly enforce temporal coherency in the
embedding space, rather than indirectly learning it through ranking or
predictive proxy tasks. In the same way that high-level visual information in
the world changes smoothly, we believe that nearby frames in learned
representations will benefit from demonstrating similar properties. Using this
assumption, we train our TCE model to encode videos such that adjacent frames
exist close to each other and videos are separated from one another. Using TCE
we learn robust representations from large quantities of unlabeled video data.
We thoroughly analyse and evaluate our self-supervised learned TCE models on a
downstream task of video action recognition using multiple challenging
benchmarks (Kinetics400, UCF101, HMDB51). With a simple but effective 2D-CNN
backbone and only RGB stream inputs, TCE pre-trained representations outperform
all previous selfsupervised 2D-CNN and 3D-CNN pre-trained on UCF101. The code
and pre-trained models for this paper can be downloaded at:
https://github.com/csiro-robotics/TCE
Related papers
- Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Cross-Architecture Self-supervised Video Representation Learning [42.267775859095664]
We present a new cross-architecture contrastive learning framework for self-supervised video representation learning.
We introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences.
We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2022-05-26T12:41:19Z) - A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs)
The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved.
We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.