Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based
Action Recognition
- URL: http://arxiv.org/abs/2303.10904v2
- Date: Tue, 21 Mar 2023 08:32:22 GMT
- Title: Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based
Action Recognition
- Authors: Lilang Lin, Jiahang Zhang, Jiaying Liu
- Abstract summary: We propose an Actionlet-Dependent Contrastive Learning method (ActCLR)
The actionlet, defined as the discriminative subset of the human skeleton, effectively decomposes motion regions for better action modeling.
Different data transformations are applied to actionlet and non-actionlet regions to introduce more diversity while maintaining their own characteristics.
- Score: 33.68311764817763
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The self-supervised pretraining paradigm has achieved great success in
skeleton-based action recognition. However, these methods treat the motion and
static parts equally, and lack an adaptive design for different parts, which
has a negative impact on the accuracy of action recognition. To realize the
adaptive action modeling of both parts, we propose an Actionlet-Dependent
Contrastive Learning method (ActCLR). The actionlet, defined as the
discriminative subset of the human skeleton, effectively decomposes motion
regions for better action modeling. In detail, by contrasting with the static
anchor without motion, we extract the motion region of the skeleton data, which
serves as the actionlet, in an unsupervised manner. Then, centering on
actionlet, a motion-adaptive data transformation method is built. Different
data transformations are applied to actionlet and non-actionlet regions to
introduce more diversity while maintaining their own characteristics.
Meanwhile, we propose a semantic-aware feature pooling method to build feature
representations among motion and static regions in a distinguished manner.
Extensive experiments on NTU RGB+D and PKUMMD show that the proposed method
achieves remarkable action recognition performance. More visualization and
quantitative experiments demonstrate the effectiveness of our method. Our
project website is available at https://langlandslin.github.io/projects/ActCLR/
Related papers
- Object-Centric Latent Action Learning [70.3173534658611]
We propose a novel object-centric latent action learning approach, based on VideoSaur and LAPO.
This method effectively disentangles causal agent-object interactions from irrelevant background noise and reduces the performance degradation caused by distractors.
Our preliminary experiments with the Distracting Control Suite show that latent action pretraining based on object decompositions improve the quality of inferred latent actions by x2.7 and efficiency of downstream fine-tuning with a small set of labeled actions, increasing return by x2.6 on average.
arXiv Detail & Related papers (2025-02-13T11:27:05Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks [4.971065912401385]
We propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition.
Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification.
We validate our method on the Charades dataset that includes a majority of object-based actions.
arXiv Detail & Related papers (2024-05-14T15:28:48Z) - Seeing in Flowing: Adapting CLIP for Action Recognition with Motion
Prompts Learning [14.292812802621707]
Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training.
We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method.
Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training.
arXiv Detail & Related papers (2023-08-09T09:33:45Z) - Action Sensitivity Learning for Temporal Action Localization [35.65086250175736]
We propose an Action Sensitivity Learning framework (ASL) to tackle the task of temporal action localization.
We first introduce a lightweight Action Sensitivity Evaluator to learn the action sensitivity at the class level and instance level, respectively.
Based on the action sensitivity of each frame, we design an Action Sensitive Contrastive Loss to enhance features, where the action-aware frames are sampled as positive pairs to push away the action-irrelevant frames.
arXiv Detail & Related papers (2023-05-25T04:19:14Z) - Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - Skeleton-Based Mutually Assisted Interacted Object Localization and
Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data.
Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Intra- and Inter-Action Understanding via Temporal Action Parsing [118.32912239230272]
We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top.
Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition.
We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
arXiv Detail & Related papers (2020-05-20T17:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.