Nearest-Neighbor Inter-Intra Contrastive Learning from Unlabeled Videos
- URL: http://arxiv.org/abs/2303.07317v1
- Date: Mon, 13 Mar 2023 17:38:58 GMT
- Title: Nearest-Neighbor Inter-Intra Contrastive Learning from Unlabeled Videos
- Authors: David Fan, Deyu Yang, Xinyu Li, Vimal Bhat, Rohith MV
- Abstract summary: State-of-the-art contrastive learning methods augment two clips from the same video as positives.
We leverage nearest-neighbor videos from the global space as additional positive pairs.
Our method, Inter-temporal Video Contrastive Learning (II), improves performance on a range of video tasks.
- Score: 8.486392464244267
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Contrastive learning has recently narrowed the gap between self-supervised
and supervised methods in image and video domain. State-of-the-art video
contrastive learning methods such as CVRL and $\rho$-MoCo spatiotemporally
augment two clips from the same video as positives. By only sampling positive
clips locally from a single video, these methods neglect other semantically
related videos that can also be useful. To address this limitation, we leverage
nearest-neighbor videos from the global space as additional positive pairs,
thus improving positive key diversity and introducing a more relaxed notion of
similarity that extends beyond video and even class boundaries. Our method,
Inter-Intra Video Contrastive Learning (IIVCL), improves performance on a range
of video tasks.
Related papers
- Learning video retrieval models with relevance-aware online mining [16.548016892117083]
A typical approach consists in learning a joint text-video embedding space, where the similarity of a video and its associated caption is maximized.
This approach assumes that only the video and caption pairs in the dataset are valid, but different captions - positives - may also describe its visual contents, hence some of them may be wrongly penalized.
We propose the Relevance-Aware Negatives and Positives mining (RANP) which, based on the semantics of the negatives, improves their selection while also increasing the similarity of other valid positives.
arXiv Detail & Related papers (2022-03-16T15:23:55Z) - Cross-modal Manifold Cutmix for Self-supervised Video Representation
Learning [50.544635516455116]
This paper focuses on designing video augmentation for self-supervised learning.
We first analyze the best strategy to mix videos to create a new augmented video sample.
We propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities.
arXiv Detail & Related papers (2021-12-07T18:58:33Z) - Video Contrastive Learning with Global Context [37.966950264445394]
We propose a new video-level contrastive learning method based on segments to formulate positive pairs.
Our formulation is able to capture global context in a video, thus robust temporal content change.
arXiv Detail & Related papers (2021-08-05T16:42:38Z) - Broaden Your Views for Self-Supervised Video Learning [97.52216510672251]
We introduce BraVe, a self-supervised learning framework for video.
In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content.
We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks.
arXiv Detail & Related papers (2021-03-30T17:58:46Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - TCLR: Temporal Contrastive Learning for Video Representation [49.6637562402604]
We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods.
With the commonly used 3D-ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification.
arXiv Detail & Related papers (2021-01-20T05:38:16Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.