View-Invariant Skeleton-based Action Recognition via Global-Local
Contrastive Learning
- URL: http://arxiv.org/abs/2209.11634v1
- Date: Fri, 23 Sep 2022 15:00:57 GMT
- Title: View-Invariant Skeleton-based Action Recognition via Global-Local
Contrastive Learning
- Authors: Cunling Bian, Wei Feng, Fanbo Meng, Song Wang
- Abstract summary: We propose a new view-invariant representation learning approach, without any manual action labeling, for skeleton-based human action recognition.
We leverage the multi-view skeleton data simultaneously taken for the same person in the network training, by maximizing the mutual information between the representations extracted from different views.
We show that the proposed method is robust to the view difference of the input skeleton data and significantly boosts the performance of unsupervised skeleton-based human action methods.
- Score: 15.271862140292837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Skeleton-based human action recognition has been drawing more interest
recently due to its low sensitivity to appearance changes and the accessibility
of more skeleton data. However, even the 3D skeletons captured in practice are
still sensitive to the viewpoint and direction gave the occlusion of different
human-body joints and the errors in human joint localization. Such view
variance of skeleton data may significantly affect the performance of action
recognition. To address this issue, we propose in this paper a new
view-invariant representation learning approach, without any manual action
labeling, for skeleton-based human action recognition. Specifically, we
leverage the multi-view skeleton data simultaneously taken for the same person
in the network training, by maximizing the mutual information between the
representations extracted from different views, and then propose a global-local
contrastive loss to model the multi-scale co-occurrence relationships in both
spatial and temporal domains. Extensive experimental results show that the
proposed method is robust to the view difference of the input skeleton data and
significantly boosts the performance of unsupervised skeleton-based human
action methods, resulting in new state-of-the-art accuracies on two challenging
multi-view benchmarks of PKUMMD and NTU RGB+D.
Related papers
- One-Shot Action Recognition via Multi-Scale Spatial-Temporal Skeleton
Matching [77.6989219290789]
One-shot skeleton action recognition aims to learn a skeleton action recognition model with a single training sample.
This paper presents a novel one-shot skeleton action recognition technique that handles skeleton action recognition via multi-scale spatial-temporal feature matching.
arXiv Detail & Related papers (2023-07-14T11:52:10Z) - Adaptive Local-Component-aware Graph Convolutional Network for One-shot
Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition.
Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Skeleton-Based Mutually Assisted Interacted Object Localization and
Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data.
Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z) - Unsupervised View-Invariant Human Posture Representation [28.840986167408037]
We present a novel unsupervised approach that learns to extract view-invariant 3D human pose representation from a 2D image.
Our model is trained by exploiting the intrinsic view-invariant properties of human pose between simultaneous frames.
We show improvements on the state-of-the-art unsupervised cross-view action classification accuracy on RGB and depth images.
arXiv Detail & Related papers (2021-09-17T19:23:31Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - Skeleton-Contrastive 3D Action Representation Learning [35.06361753065124]
This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition.
Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets.
arXiv Detail & Related papers (2021-08-08T14:44:59Z) - Group-Skeleton-Based Human Action Recognition in Complex Events [15.649778891665468]
We propose a novel group-skeleton-based human action recognition method in complex events.
This method first utilizes multi-scale spatial-temporal graph convolutional networks (MS-G3Ds) to extract skeleton features from multiple persons.
Results on the HiEve dataset show that our method can give superior performance compared to other state-of-the-art methods.
arXiv Detail & Related papers (2020-11-26T13:19:14Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.