Collaboratively Self-supervised Video Representation Learning for Action Recognition
- URL: http://arxiv.org/abs/2401.07584v2
- Date: Fri, 31 Jan 2025 14:33:03 GMT
- Title: Collaboratively Self-supervised Video Representation Learning for Action Recognition
- Authors: Jie Zhang, Zhifan Wan, Lanqing Hu, Stephen Lin, Shuzhe Wu, Shiguang Shan,
- Abstract summary: We design a Collaboratively Self-supervised Video Representation learning framework specific to action recognition.
Our method achieves state-of-the-art performance on multiple popular video datasets.
- Score: 54.92120002380786
- License:
- Abstract: Considering the close connection between action recognition and human pose estimation, we design a Collaboratively Self-supervised Video Representation (CSVR) learning framework specific to action recognition by jointly factoring in generative pose prediction and discriminative context matching as pretext tasks. Specifically, our CSVR consists of three branches: a generative pose prediction branch, a discriminative context matching branch, and a video generating branch. Among them, the first one encodes dynamic motion feature by utilizing Conditional-GAN to predict the human poses of future frames, and the second branch extracts static context features by contrasting positive and negative video feature and I-frame feature pairs. The third branch is designed to generate both current and future video frames, for the purpose of collaboratively improving dynamic motion features and static context features. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple popular video datasets.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection [7.202931445597171]
We present a novel network that detects actions in untrimmed videos.
The network encodes the locations of action semantics in video frames utilizing motion-aware 2D positional encoding.
The approach outperforms the state-the-art solutions on four proposed datasets.
arXiv Detail & Related papers (2024-05-13T21:47:35Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Towards Tokenized Human Dynamics Representation [41.75534387530019]
We study how to segment and cluster videos into recurring temporal patterns in a self-supervised way.
We evaluate the frame-wise representation learning step by Kendall's Tau and the lexicon building step by normalized mutual information and language entropy.
On the AIST++ and PKU-MMD datasets, actons bring significant performance improvements compared to several baselines.
arXiv Detail & Related papers (2021-11-22T18:59:58Z) - Target Adaptive Context Aggregation for Video Scene Graph Generation [36.669700084337045]
This paper deals with a challenging task of video scene graph generation (VidSGG)
We present a new em detect-to-track paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking.
arXiv Detail & Related papers (2021-08-18T12:46:28Z) - Visual Relationship Forecasting in Videos [56.122037294234865]
We present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a manner of reasoning.
Given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence.
To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series oftemporally localized visual relation annotations in a video.
arXiv Detail & Related papers (2021-07-02T16:43:19Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.