Self-Supervised Video Representation Learning by Video Incoherence
Detection
- URL: http://arxiv.org/abs/2109.12493v1
- Date: Sun, 26 Sep 2021 04:58:13 GMT
- Title: Self-Supervised Video Representation Learning by Video Incoherence
Detection
- Authors: Haozhi Cao, Yuecong Xu, Jianfei Yang, Kezhi Mao, Lihua Xie, Jianxiong
Yin, Simon See
- Abstract summary: This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning.
It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos.
- Score: 28.540645395066434
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces a novel self-supervised method that leverages
incoherence detection for video representation learning. It roots from the
observation that visual systems of human beings can easily identify video
incoherence based on their comprehensive understanding of videos. Specifically,
the training sample, denoted as the incoherent clip, is constructed by multiple
sub-clips hierarchically sampled from the same raw video with various lengths
of incoherence between each other. The network is trained to learn high-level
representation by predicting the location and length of incoherence given the
incoherent clip as input. Additionally, intra-video contrastive learning is
introduced to maximize the mutual information between incoherent clips from the
same raw video. We evaluate our proposed method through extensive experiments
on action recognition and video retrieval utilizing various backbone networks.
Experiments show that our proposed method achieves state-of-the-art performance
across different backbone networks and different datasets compared with
previous coherence-based methods.
Related papers
- SELF-VS: Self-supervised Encoding Learning For Video Summarization [6.21295508577576]
We propose a novel self-supervised video representation learning method using knowledge distillation to pre-train a transformer encoder.
Our method matches its semantic video representation, which is constructed with respect to frame importance scores, to a representation derived from a CNN trained on video classification.
arXiv Detail & Related papers (2023-03-28T14:08:05Z) - Probabilistic Representations for Video Contrastive Learning [64.47354178088784]
This paper presents a self-supervised representation learning method that bridges contrastive learning with probabilistic representation.
By sampling embeddings from the whole video distribution, we can circumvent the careful sampling strategy or transformations to generate augmented views of the clips.
arXiv Detail & Related papers (2022-04-08T09:09:30Z) - Learning from Untrimmed Videos: Self-Supervised Video Representation
Learning with Hierarchical Consistency [60.756222188023635]
We propose to learn representations by leveraging more abundant information in unsupervised videos.
HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos.
arXiv Detail & Related papers (2022-04-06T18:04:54Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - Spatio-Temporal Perturbations for Video Attribution [33.19422909074655]
The attribution method provides a direction for interpreting opaque neural networks in a visual way.
We investigate a generic-based attribution method that is compatible with diversified video understanding networks.
We introduce reliable objective metrics which are checked by a newly proposed reliability measurement.
arXiv Detail & Related papers (2021-09-01T07:44:16Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Multiview Pseudo-Labeling for Semi-supervised Learning from Video [102.36355560553402]
We present a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.
Our method capitalizes on multiple views, but it nonetheless trains a model that is shared across appearance and motion input.
On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
arXiv Detail & Related papers (2021-04-01T17:59:48Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Self-supervised Video Representation Learning by Pace Prediction [48.029602040786685]
This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction.
It stems from the observation that human visual system is sensitive to video pace.
We randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip.
arXiv Detail & Related papers (2020-08-13T12:40:24Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.