Multi-Semantic Fusion Model for Generalized Zero-Shot Skeleton-Based
Action Recognition
- URL: http://arxiv.org/abs/2309.09592v1
- Date: Mon, 18 Sep 2023 09:00:25 GMT
- Title: Multi-Semantic Fusion Model for Generalized Zero-Shot Skeleton-Based
Action Recognition
- Authors: Ming-Zhe Li, Zhen Jia, Zhang Zhang, Zhanyu Ma, and Liang Wang
- Abstract summary: Generalized zero-shot skeleton-based action recognition (GZSSAR) is a new challenging problem in computer vision community.
We propose a multi-semantic fusion (MSF) model for improving the performance of GZSSAR.
- Score: 32.291333054680855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalized zero-shot skeleton-based action recognition (GZSSAR) is a new
challenging problem in computer vision community, which requires models to
recognize actions without any training samples. Previous studies only utilize
the action labels of verb phrases as the semantic prototypes for learning the
mapping from skeleton-based actions to a shared semantic space. However, the
limited semantic information of action labels restricts the generalization
ability of skeleton features for recognizing unseen actions. In order to solve
this dilemma, we propose a multi-semantic fusion (MSF) model for improving the
performance of GZSSAR, where two kinds of class-level textual descriptions
(i.e., action descriptions and motion descriptions), are collected as auxiliary
semantic information to enhance the learning efficacy of generalizable skeleton
features. Specially, a pre-trained language encoder takes the action
descriptions, motion descriptions and original class labels as inputs to obtain
rich semantic features for each action class, while a skeleton encoder is
implemented to extract skeleton features. Then, a variational autoencoder (VAE)
based generative module is performed to learn a cross-modal alignment between
skeleton and semantic features. Finally, a classification module is built to
recognize the action categories of input samples, where a seen-unseen
classification gate is adopted to predict whether the sample comes from seen
action classes or not in GZSSAR. The superior performance in comparisons with
previous models validates the effectiveness of the proposed MSF model on
GZSSAR.
Related papers
- Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples.
Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs)
For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised
Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps.
We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z) - SEER-ZSL: Semantic Encoder-Enhanced Representations for Generalized
Zero-Shot Learning [0.7420433640907689]
Generalized Zero-Shot Learning (GZSL) recognizes unseen classes by transferring knowledge from the seen classes.
This paper introduces a dual strategy to address the generalization gap.
arXiv Detail & Related papers (2023-12-20T15:18:51Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Part-aware Prototypical Graph Network for One-shot Skeleton-based Action
Recognition [57.86960990337986]
One-shot skeleton-based action recognition poses unique challenges in learning transferable representation from base classes to novel classes.
We propose a part-aware prototypical representation for one-shot skeleton-based action recognition.
We demonstrate the effectiveness of our method on two public skeleton-based action recognition datasets.
arXiv Detail & Related papers (2022-08-19T04:54:56Z) - Generative Action Description Prompts for Skeleton-based Action
Recognition [15.38417530693649]
We propose a Generative Action-description Prompts (GAP) approach for skeleton-based action recognition.
We employ a pre-trained large-scale language model as the knowledge engine to automatically generate text descriptions for body parts movements of actions.
Our proposed GAP method achieves noticeable improvements over various baseline models without extra cost at inference.
arXiv Detail & Related papers (2022-08-10T12:55:56Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - SimMC: Simple Masked Contrastive Learning of Skeleton Representations
for Unsupervised Person Re-Identification [63.903237777588316]
We present a generic Simple Masked Contrastive learning (SimMC) framework to learn effective representations from unlabeled 3D skeletons for person re-ID.
Specifically, to fully exploit skeleton features within each skeleton sequence, we first devise a masked prototype contrastive learning (MPC) scheme.
Then, we propose the masked intra-sequence contrastive learning (MIC) to capture intra-sequence pattern consistency between subsequences.
arXiv Detail & Related papers (2022-04-21T00:19:38Z) - GAN for Vision, KG for Relation: a Two-stage Deep Network for Zero-shot
Action Recognition [33.23662792742078]
We propose a two-stage deep neural network for zero-shot action recognition.
In the sampling stage, we utilize a generative adversarial networks (GAN) trained by action features and word vectors of seen classes.
In the classification stage, we construct a knowledge graph based on the relationship between word vectors of action classes and related objects.
arXiv Detail & Related papers (2021-05-25T09:34:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.