Related papers: Collaboratively Self-supervised Video Representation Learning for Action Recognition

Collaboratively Self-supervised Video Representation Learning for Action Recognition

URL: http://arxiv.org/abs/2401.07584v2
Date: Fri, 31 Jan 2025 14:33:03 GMT
Title: Collaboratively Self-supervised Video Representation Learning for Action Recognition
Authors: Jie Zhang, Zhifan Wan, Lanqing Hu, Stephen Lin, Shuzhe Wu, Shiguang Shan,
Abstract summary: We design a Collaboratively Self-supervised Video Representation learning framework specific to action recognition.<n>Our method achieves state-of-the-art performance on multiple popular video datasets.
Score: 54.92120002380786
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Considering the close connection between action recognition and human pose estimation, we design a Collaboratively Self-supervised Video Representation (CSVR) learning framework specific to action recognition by jointly factoring in generative pose prediction and discriminative context matching as pretext tasks. Specifically, our CSVR consists of three branches: a generative pose prediction branch, a discriminative context matching branch, and a video generating branch. Among them, the first one encodes dynamic motion feature by utilizing Conditional-GAN to predict the human poses of future frames, and the second branch extracts static context features by contrasting positive and negative video feature and I-frame feature pairs. The third branch is designed to generate both current and future video frames, for the purpose of collaboratively improving dynamic motion features and static context features. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple popular video datasets.

Related papers

Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition [53.02634128715853]
Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. We propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR. It unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view.
arXiv Detail & Related papers (2025-04-14T10:23:22Z)
Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks. We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z)
Online Video Instance Segmentation via Robust Context Fusion [36.376900904288966]
Video instance segmentation (VIS) aims at classifying, segmenting and tracking object instances in video sequences. Recent transformer-based neural networks have demonstrated their powerful capability of modeling for the VIS task. We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames.
arXiv Detail & Related papers (2022-07-12T15:04:50Z)
Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp. SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)
Towards Tokenized Human Dynamics Representation [41.75534387530019]
We study how to segment and cluster videos into recurring temporal patterns in a self-supervised way. We evaluate the frame-wise representation learning step by Kendall's Tau and the lexicon building step by normalized mutual information and language entropy. On the AIST++ and PKU-MMD datasets, actons bring significant performance improvements compared to several baselines.
arXiv Detail & Related papers (2021-11-22T18:59:58Z)
Target Adaptive Context Aggregation for Video Scene Graph Generation [36.669700084337045]
This paper deals with a challenging task of video scene graph generation (VidSGG) We present a new em detect-to-track paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking.
arXiv Detail & Related papers (2021-08-18T12:46:28Z)
Visual Relationship Forecasting in Videos [56.122037294234865]
We present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a manner of reasoning. Given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence. To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series oftemporally localized visual relation annotations in a video.
arXiv Detail & Related papers (2021-07-02T16:43:19Z)
Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild. Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation. Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
Unsupervised Learning of Video Representations via Dense Trajectory Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos. We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation. We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.