Cross-Architecture Self-supervised Video Representation Learning
- URL: http://arxiv.org/abs/2205.13313v1
- Date: Thu, 26 May 2022 12:41:19 GMT
- Title: Cross-Architecture Self-supervised Video Representation Learning
- Authors: Sheng Guo, Zihua Xiong, Yujie Zhong, Limin Wang, Xiaobo Guo, Bing Han,
Weilin Huang
- Abstract summary: We present a new cross-architecture contrastive learning framework for self-supervised video representation learning.
We introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences.
We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets.
- Score: 42.267775859095664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a new cross-architecture contrastive learning
(CACL) framework for self-supervised video representation learning. CACL
consists of a 3D CNN and a video transformer which are used in parallel to
generate diverse positive pairs for contrastive learning. This allows the model
to learn strong representations from such diverse yet meaningful pairs.
Furthermore, we introduce a temporal self-supervised learning module able to
predict an Edit distance explicitly between two video sequences in the temporal
order. This enables the model to learn a rich temporal representation that
compensates strongly to the video-level representation learned by the CACL. We
evaluate our method on the tasks of video retrieval and action recognition on
UCF101 and HMDB51 datasets, where our method achieves excellent performance,
surpassing the state-of-the-art methods such as VideoMoCo and MoCo+BE by a
large margin. The code is made available at https://github.com/guoshengcv/CACL.
Related papers
- Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Cycle-Contrast for Self-Supervised Video Representation Learning [10.395615031496064]
We present Cycle-Contrastive Learning (CCL), a novel self-supervised method for learning video representation.
In our method, the frame and video representations are learned from a single network based on an R3D architecture.
We demonstrate that the video representation learned by CCL can be transferred well to downstream tasks of video understanding.
arXiv Detail & Related papers (2020-10-28T08:27:58Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z) - Temporally Coherent Embeddings for Self-Supervised Video Representation
Learning [2.216657815393579]
This paper presents TCE: Temporally Coherent Embeddings for self-supervised video representation learning.
The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space.
With a simple but effective 2D-CNN backbone and only RGB stream inputs, TCE pre-trained representations outperform all previous selfsupervised 2D-CNN and 3D-CNN pre-trained on UCF101.
arXiv Detail & Related papers (2020-03-21T12:25:50Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.