Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning
- URL: http://arxiv.org/abs/2203.14957v1
- Date: Mon, 28 Mar 2022 17:59:54 GMT
- Title: Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning
- Authors: Minghao Chen, Fangyun Wei, Chong Li, Deng Cai
- Abstract summary: We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
- Score: 44.412145665354736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior works on action representation learning mainly focus on designing
various architectures to extract the global representations for short video
clips. In contrast, many practical applications such as video alignment have
strong demand for learning dense representations for long videos. In this
paper, we introduce a novel contrastive action representation learning (CARL)
framework to learn frame-wise action representations, especially for long
videos, in a self-supervised manner. Concretely, we introduce a simple yet
efficient video encoder that considers spatio-temporal context to extract
frame-wise representations. Inspired by the recent progress of self-supervised
learning, we present a novel sequence contrastive loss (SCL) applied on two
correlated views obtained through a series of spatio-temporal data
augmentations. SCL optimizes the embedding space by minimizing the
KL-divergence between the sequence similarity of two augmented views and a
prior Gaussian distribution of timestamp distance. Experiments on FineGym,
PennAction and Pouring datasets show that our method outperforms previous
state-of-the-art by a large margin for downstream fine-grained action
classification. Surprisingly, although without training on paired videos, our
approach also shows outstanding performance on video alignment and fine-grained
frame retrieval tasks. Code and models are available at
https://github.com/minghchen/CARL_code.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - Cross-Architecture Self-supervised Video Representation Learning [42.267775859095664]
We present a new cross-architecture contrastive learning framework for self-supervised video representation learning.
We introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences.
We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2022-05-26T12:41:19Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Learning Implicit Temporal Alignment for Few-shot Video Classification [40.57508426481838]
Few-shot video classification aims to learn new video categories with only a few labeled examples.
It is particularly challenging to learn a class-invariant spatial-temporal representation in such a setting.
We propose a novel matching-based few-shot learning strategy for video sequences in this work.
arXiv Detail & Related papers (2021-05-11T07:18:57Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.