Temporal Contrastive Learning with Curriculum
- URL: http://arxiv.org/abs/2209.00760v1
- Date: Fri, 2 Sep 2022 00:12:05 GMT
- Title: Temporal Contrastive Learning with Curriculum
- Authors: Shuvendu Roy, Ali Etemad
- Abstract summary: ConCur is a contrastive video representation learning method that uses curriculum learning to impose a dynamic sampling strategy.
We conduct experiments on two popular action recognition datasets, UCF101 and HMDB51, on which our proposed method achieves state-of-the-art performance.
- Score: 19.442685015494316
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present ConCur, a contrastive video representation learning method that
uses curriculum learning to impose a dynamic sampling strategy in contrastive
training. More specifically, ConCur starts the contrastive training with easy
positive samples (temporally close and semantically similar clips), and as the
training progresses, it increases the temporal span effectively sampling hard
positives (temporally away and semantically dissimilar). To learn better
context-aware representations, we also propose an auxiliary task of predicting
the temporal distance between a positive pair of clips. We conduct extensive
experiments on two popular action recognition datasets, UCF101 and HMDB51, on
which our proposed method achieves state-of-the-art performance on two
benchmark tasks of video action recognition and video retrieval. We explore the
impact of encoder backbones and pre-training strategies by using R(2+1)D and
C3D encoders and pre-training on Kinetics-400 and Kinetics-200 datasets.
Moreover, a detailed ablation study shows the effectiveness of each of the
components of our proposed method.
Related papers
- Towards Efficient and Effective Text-to-Video Retrieval with
Coarse-to-Fine Visual Representation Learning [15.998149438353133]
We propose a two-stage retrieval architecture for text-to-video retrieval.
In training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning.
In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations.
arXiv Detail & Related papers (2024-01-01T08:54:18Z) - Seeing in Flowing: Adapting CLIP for Action Recognition with Motion
Prompts Learning [14.292812802621707]
Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training.
We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method.
Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training.
arXiv Detail & Related papers (2023-08-09T09:33:45Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Self-supervised Co-training for Video Representation Learning [103.69904379356413]
We investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation training.
We propose a novel self-supervised co-training scheme to improve the popular infoNCE loss.
We evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval.
arXiv Detail & Related papers (2020-10-19T17:59:01Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.