Zero-shot Skeleton-based Action Recognition via Mutual Information
Estimation and Maximization
- URL: http://arxiv.org/abs/2308.03950v1
- Date: Mon, 7 Aug 2023 23:41:55 GMT
- Title: Zero-shot Skeleton-based Action Recognition via Mutual Information
Estimation and Maximization
- Authors: Yujie Zhou, Wenwen Qiang, Anyi Rao, Ning Lin, Bing Su, Jiaqi Wang
- Abstract summary: Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories.
We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and estimation.
- Score: 26.721082316870532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot skeleton-based action recognition aims to recognize actions of
unseen categories after training on data of seen categories. The key is to
build the connection between visual and semantic space from seen to unseen
classes. Previous studies have primarily focused on encoding sequences into a
singular feature vector, with subsequent mapping the features to an identical
anchor point within the embedded space. Their performance is hindered by 1) the
ignorance of the global visual/semantic distribution alignment, which results
in a limitation to capture the true interdependence between the two spaces. 2)
the negligence of temporal information since the frame-wise features with rich
action clues are directly pooled into a single feature vector. We propose a new
zero-shot skeleton-based action recognition method via mutual information (MI)
estimation and maximization. Specifically, 1) we maximize the MI between visual
and semantic space for distribution alignment; 2) we leverage the temporal
information for estimating the MI by encouraging MI to increase as more frames
are observed. Extensive experiments on three large-scale skeleton action
datasets confirm the effectiveness of our method. Code:
https://github.com/YujieOuO/SMIE.
Related papers
- An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition [18.012159340628557]
We propose a novel method via Side information and dual-prompts learning for skeleton-based zero-shot action recognition (STAR) at the fine-grained level.
Our method achieves state-of-the-art performance in ZSL and GZSL settings on datasets.
arXiv Detail & Related papers (2024-04-11T05:51:06Z) - Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence.
We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps.
We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z) - Self-supervised Action Representation Learning from Partial
Spatio-Temporal Skeleton Sequences [29.376328807860993]
We propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship between different skeleton joints and video frames.
Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGBMM+D 120 and PKU-D under various downstream tasks.
arXiv Detail & Related papers (2023-02-17T17:35:05Z) - Adaptive Local-Component-aware Graph Convolutional Network for One-shot
Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition.
Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - CLTA: Contents and Length-based Temporal Attention for Few-shot Action
Recognition [2.0349696181833337]
We propose a Contents and Length-based Temporal Attention model, which learns customized temporal attention for the individual video.
We show that even a not fine-tuned backbone with an ordinary softmax classifier can still achieve similar or better results compared to the state-of-the-art few-shot action recognition.
arXiv Detail & Related papers (2021-03-18T23:40:28Z) - Memory Group Sampling Based Online Action Recognition Using Kinetic
Skeleton Features [4.674689979981502]
We propose two core ideas to handle the online action recognition problem.
First, we combine the spatial and temporal skeleton features to depict the actions.
Second, we propose a memory group sampling method to combine the previous action frames and current action frames.
Third, an improved 1D CNN network is employed for training and testing using the features from sampled frames.
arXiv Detail & Related papers (2020-11-01T16:43:08Z) - A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D
Skeleton Based Person Re-Identification [65.18004601366066]
Person re-identification (Re-ID) via gait features within 3D skeleton sequences is a newly-emerging topic with several advantages.
This paper proposes a self-supervised gait encoding approach that can leverage unlabeled skeleton data to learn gait representations for person Re-ID.
arXiv Detail & Related papers (2020-09-05T16:06:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.