Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations
- URL: http://arxiv.org/abs/2212.03125v1
- Date: Tue, 6 Dec 2022 16:42:22 GMT
- Title: Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations
- Authors: Minghao Chen, Renbo Tu, Chenxi Huang, Yuqi Lin, Boxi Wu, Deng Cai
- Abstract summary: We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
- Score: 26.09611987412578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous work on action representation learning focused on global
representations for short video clips. In contrast, many practical
applications, such as video alignment, strongly demand learning the intensive
representation of long videos. In this paper, we introduce a new framework of
contrastive action representation learning (CARL) to learn frame-wise action
representation in a self-supervised or weakly-supervised manner, especially for
long videos. Specifically, we introduce a simple but effective video encoder
that considers both spatial and temporal context by combining convolution and
transformer. Inspired by the recent massive progress in self-supervised
learning, we propose a new sequence contrast loss (SCL) applied to two related
views obtained by expanding a series of spatio-temporal data in two versions.
One is the self-supervised version that optimizes embedding space by minimizing
KL-divergence between sequence similarity of two augmented views and prior
Gaussian distribution of timestamp distance. The other is the weakly-supervised
version that builds more sample pairs among videos using video-level labels by
dynamic time wrapping (DTW). Experiments on FineGym, PennAction, and Pouring
datasets show that our method outperforms previous state-of-the-art by a large
margin for downstream fine-grained action classification and even faster
inference. Surprisingly, although without training on paired videos like in
previous works, our self-supervised version also shows outstanding performance
in video alignment and fine-grained frame retrieval tasks.
Related papers
- Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Learning Implicit Temporal Alignment for Few-shot Video Classification [40.57508426481838]
Few-shot video classification aims to learn new video categories with only a few labeled examples.
It is particularly challenging to learn a class-invariant spatial-temporal representation in such a setting.
We propose a novel matching-based few-shot learning strategy for video sequences in this work.
arXiv Detail & Related papers (2021-05-11T07:18:57Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.