Contrastive Learning of Image Representations with Cross-Video
Cycle-Consistency
- URL: http://arxiv.org/abs/2105.06463v1
- Date: Thu, 13 May 2021 17:59:11 GMT
- Title: Contrastive Learning of Image Representations with Cross-Video
Cycle-Consistency
- Authors: Haiping Wu, Xiaolong Wang
- Abstract summary: Cross-video relation has barely been explored for visual representation learning.
We propose a novel contrastive learning method which explores the cross-video relation by using cycle-consistency for general image representation learning.
We show significant improvement over state-of-the-art contrastive learning methods.
- Score: 13.19476138523546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have advanced the performance of self-supervised representation
learning by a large margin. The core among these methods is intra-image
invariance learning. Two different transformations of one image instance are
considered as a positive sample pair, where various tasks are designed to learn
invariant representations by comparing the pair. Analogically, for video data,
representations of frames from the same video are trained to be closer than
frames from other videos, i.e. intra-video invariance. However, cross-video
relation has barely been explored for visual representation learning. Unlike
intra-video invariance, ground-truth labels of cross-video relation is usually
unavailable without human labors. In this paper, we propose a novel contrastive
learning method which explores the cross-video relation by using
cycle-consistency for general image representation learning. This allows to
collect positive sample pairs across different video instances, which we
hypothesize will lead to higher-level semantics. We validate our method by
transferring our image representation to multiple downstream tasks including
visual object tracking, image classification, and action recognition. We show
significant improvement over state-of-the-art contrastive learning methods.
Project page is available at https://happywu.github.io/cycle_contrast_video.
Related papers
- JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion
Retargeting [53.28477676794658]
unsupervised motion in videos has seen substantial advancements through the use of deep neural networks.
We introduce JOKR - a JOint Keypoint Representation that handles both the source and target videos, without requiring any object prior or data collection.
We evaluate our method both qualitatively and quantitatively, and demonstrate that our method handles various cross-domain scenarios, such as different animals, different flowers, and humans.
arXiv Detail & Related papers (2021-06-17T17:32:32Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Rethinking Self-supervised Correspondence Learning: A Video Frame-level
Similarity Perspective [13.90183404059193]
We propose to learn correspondence using Video Frame-level Similarity (VFS) learning.
Our work is inspired by the recent success in image-level contrastive learning and similarity learning for visual recognition.
Our experiments show surprising results that VFS surpasses state-of-the-art self-supervised approaches for both OTB visual object tracking and DAVIS video object segmentation.
arXiv Detail & Related papers (2021-03-31T17:56:35Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z) - Watching the World Go By: Representation Learning from Unlabeled Videos [78.22211989028585]
Recent single image unsupervised representation learning techniques show remarkable success on a variety of tasks.
In this paper, we argue that videos offer this natural augmentation for free.
We propose Video Noise Contrastive Estimation, a method for using unlabeled video to learn strong, transferable single image representations.
arXiv Detail & Related papers (2020-03-18T00:07:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.