Video Contrastive Learning with Global Context
- URL: http://arxiv.org/abs/2108.02722v1
- Date: Thu, 5 Aug 2021 16:42:38 GMT
- Title: Video Contrastive Learning with Global Context
- Authors: Haofei Kuang, Yi Zhu, Zhi Zhang, Xinyu Li, Joseph Tighe, S\"oren
Schwertfeger, Cyrill Stachniss, Mu Li
- Abstract summary: We propose a new video-level contrastive learning method based on segments to formulate positive pairs.
Our formulation is able to capture global context in a video, thus robust temporal content change.
- Score: 37.966950264445394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive learning has revolutionized self-supervised image representation
learning field, and recently been adapted to video domain. One of the greatest
advantages of contrastive learning is that it allows us to flexibly define
powerful loss objectives as long as we can find a reasonable way to formulate
positive and negative samples to contrast. However, existing approaches rely
heavily on the short-range spatiotemporal salience to form clip-level
contrastive signals, thus limit themselves from using global context. In this
paper, we propose a new video-level contrastive learning method based on
segments to formulate positive pairs. Our formulation is able to capture global
context in a video, thus robust to temporal content change. We also incorporate
a temporal order regularization term to enforce the inherent sequential
structure of videos. Extensive experiments show that our video-level
contrastive learning framework (VCLR) is able to outperform previous
state-of-the-arts on five video datasets for downstream action classification,
action localization and video retrieval. Code is available at
https://github.com/amazon-research/video-contrastive-learning.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Nearest-Neighbor Inter-Intra Contrastive Learning from Unlabeled Videos [8.486392464244267]
State-of-the-art contrastive learning methods augment two clips from the same video as positives.
We leverage nearest-neighbor videos from the global space as additional positive pairs.
Our method, Inter-temporal Video Contrastive Learning (II), improves performance on a range of video tasks.
arXiv Detail & Related papers (2023-03-13T17:38:58Z) - TempCLR: Temporal Alignment Representation with Contrastive Learning [35.12182087403215]
We propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly.
In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances.
arXiv Detail & Related papers (2022-12-28T08:10:31Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations.
The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z) - Learning from Untrimmed Videos: Self-Supervised Video Representation
Learning with Hierarchical Consistency [60.756222188023635]
We propose to learn representations by leveraging more abundant information in unsupervised videos.
HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos.
arXiv Detail & Related papers (2022-04-06T18:04:54Z) - Controllable Augmentations for Video Representation Learning [34.79719112810065]
We propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as minimization general long-term temporal relations.
Our framework is superior on three video benchmarks in action recognition and video retrieval, capturing more accurate temporal dynamics.
arXiv Detail & Related papers (2022-03-30T19:34:32Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.