Collaboratively Self-supervised Video Representation Learning for Action
Recognition
- URL: http://arxiv.org/abs/2401.07584v1
- Date: Mon, 15 Jan 2024 10:42:04 GMT
- Title: Collaboratively Self-supervised Video Representation Learning for Action
Recognition
- Authors: Jie Zhang, Zhifan Wan, Lanqing Hu, Stephen Lin, Shuzhe Wu, Shiguang
Shan
- Abstract summary: We design a Collaboratively Self-supervised Video Representation learning framework specific to action recognition.
Our method achieves state-of-the-art performance on the UCF101 and HMDB51 datasets.
- Score: 58.195372471117615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Considering the close connection between action recognition and human pose
estimation, we design a Collaboratively Self-supervised Video Representation
(CSVR) learning framework specific to action recognition by jointly considering
generative pose prediction and discriminative context matching as pretext
tasks. Specifically, our CSVR consists of three branches: a generative pose
prediction branch, a discriminative context matching branch, and a video
generating branch. Among them, the first one encodes dynamic motion feature by
utilizing Conditional-GAN to predict the human poses of future frames, and the
second branch extracts static context features by pulling the representations
of clips and compressed key frames from the same video together while pushing
apart the pairs from different videos. The third branch is designed to recover
the current video frames and predict the future ones, for the purpose of
collaboratively improving dynamic motion features and static context features.
Extensive experiments demonstrate that our method achieves state-of-the-art
performance on the UCF101 and HMDB51 datasets.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - Online Video Instance Segmentation via Robust Context Fusion [36.376900904288966]
Video instance segmentation (VIS) aims at classifying, segmenting and tracking object instances in video sequences.
Recent transformer-based neural networks have demonstrated their powerful capability of modeling for the VIS task.
We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames.
arXiv Detail & Related papers (2022-07-12T15:04:50Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Towards Tokenized Human Dynamics Representation [41.75534387530019]
We study how to segment and cluster videos into recurring temporal patterns in a self-supervised way.
We evaluate the frame-wise representation learning step by Kendall's Tau and the lexicon building step by normalized mutual information and language entropy.
On the AIST++ and PKU-MMD datasets, actons bring significant performance improvements compared to several baselines.
arXiv Detail & Related papers (2021-11-22T18:59:58Z) - Visual Relationship Forecasting in Videos [56.122037294234865]
We present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a manner of reasoning.
Given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence.
To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series oftemporally localized visual relation annotations in a video.
arXiv Detail & Related papers (2021-07-02T16:43:19Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.