Learning from Untrimmed Videos: Self-Supervised Video Representation
Learning with Hierarchical Consistency
- URL: http://arxiv.org/abs/2204.03017v1
- Date: Wed, 6 Apr 2022 18:04:54 GMT
- Title: Learning from Untrimmed Videos: Self-Supervised Video Representation
Learning with Hierarchical Consistency
- Authors: Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yi Xu, Xiang Wang, Mingqian
Tang, Changxin Gao, Rong Jin, Nong Sang
- Abstract summary: We propose to learn representations by leveraging more abundant information in unsupervised videos.
HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos.
- Score: 60.756222188023635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural videos provide rich visual contents for self-supervised learning. Yet
most existing approaches for learning spatio-temporal representations rely on
manually trimmed videos, leading to limited diversity in visual patterns and
limited performance gain. In this work, we aim to learn representations by
leveraging more abundant information in untrimmed videos. To this end, we
propose to learn a hierarchy of consistencies in videos, i.e., visual
consistency and topical consistency, corresponding respectively to clip pairs
that tend to be visually similar when separated by a short time span and share
similar topics when separated by a long time span. Specifically, a hierarchical
consistency learning framework HiCo is presented, where the visually consistent
pairs are encouraged to have the same representation through contrastive
learning, while the topically consistent pairs are coupled through a topical
classifier that distinguishes whether they are topic related. Further, we
impose a gradual sampling algorithm for proposed hierarchical consistency
learning, and demonstrate its theoretical superiority. Empirically, we show
that not only HiCo can generate stronger representations on untrimmed videos,
it also improves the representation quality when applied to trimmed videos.
This is in contrast to standard contrastive learning that fails to learn
appropriate representations from untrimmed videos.
Related papers
- Self-Supervised Video Representation Learning by Video Incoherence
Detection [28.540645395066434]
This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning.
It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos.
arXiv Detail & Related papers (2021-09-26T04:58:13Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Learning Implicit Temporal Alignment for Few-shot Video Classification [40.57508426481838]
Few-shot video classification aims to learn new video categories with only a few labeled examples.
It is particularly challenging to learn a class-invariant spatial-temporal representation in such a setting.
We propose a novel matching-based few-shot learning strategy for video sequences in this work.
arXiv Detail & Related papers (2021-05-11T07:18:57Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Multiview Pseudo-Labeling for Semi-supervised Learning from Video [102.36355560553402]
We present a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.
Our method capitalizes on multiple views, but it nonetheless trains a model that is shared across appearance and motion input.
On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
arXiv Detail & Related papers (2021-04-01T17:59:48Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - We Have So Much In Common: Modeling Semantic Relational Set Abstractions
in Videos [29.483605238401577]
We propose an approach for learning semantic relational set abstractions on videos, inspired by human learning.
We combine visual features with natural language supervision to generate high-level representations of similarities across a set of videos.
arXiv Detail & Related papers (2020-08-12T22:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.