PSUMNet: Unified Modality Part Streams are All You Need for Efficient
Pose-based Action Recognition
- URL: http://arxiv.org/abs/2208.05775v1
- Date: Thu, 11 Aug 2022 12:12:07 GMT
- Title: PSUMNet: Unified Modality Part Streams are All You Need for Efficient
Pose-based Action Recognition
- Authors: Neel Trivedi, Ravi Kiran Sarvadevabhatla
- Abstract summary: We introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition.
At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams.
PSUMNet is highly efficient and outperforms competing methods which use 100%-400% more parameters.
- Score: 10.340665633567081
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pose-based action recognition is predominantly tackled by approaches which
treat the input skeleton in a monolithic fashion, i.e. joints in the pose tree
are processed as a whole. However, such approaches ignore the fact that action
categories are often characterized by localized action dynamics involving only
small subsets of part joint groups involving hands (e.g. `Thumbs up') or legs
(e.g. `Kicking'). Although part-grouping based approaches exist, each part
group is not considered within the global pose frame, causing such methods to
fall short. Further, conventional approaches employ independent modality
streams (e.g. joint, bone, joint velocity, bone velocity) and train their
network multiple times on these streams, which massively increases the number
of training parameters. To address these issues, we introduce PSUMNet, a novel
approach for scalable and efficient pose-based action recognition. At the
representation level, we propose a global frame based part stream approach as
opposed to conventional modality based streams. Within each part stream, the
associated data from multiple modalities is unified and consumed by the
processing pipeline. Experimentally, PSUMNet achieves state of the art
performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton
dataset NTU 60-X/120-X. PSUMNet is highly efficient and outperforms competing
methods which use 100%-400% more parameters. PSUMNet also generalizes to the
SHREC hand gesture dataset with competitive performance. Overall, PSUMNet's
scalability, performance and efficiency makes it an attractive choice for
action recognition and for deployment on compute-restricted embedded and edge
devices. Code and pretrained models can be accessed at
https://github.com/skelemoa/psumnet
Related papers
- Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition [57.97930719585095]
We introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales.
Our approach is evaluated on various skeleton/language backbones and three large-scale datasets.
The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains.
arXiv Detail & Related papers (2024-06-19T08:22:32Z) - Explore Human Parsing Modality for Action Recognition [17.624946657761996]
We propose a new dual-branch framework called Ensemble Human Parsing and Pose Network (EPP-Net)
EPP-Net is the first to leverage both skeletons and human parsing modalities for action recognition.
arXiv Detail & Related papers (2024-01-04T08:43:41Z) - Adaptive Local-Component-aware Graph Convolutional Network for One-shot
Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition.
Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - FV-UPatches: Enhancing Universality in Finger Vein Recognition [0.6299766708197883]
We propose a universal learning-based framework, which achieves generalization while training with limited data.
The proposed framework shows application potential in other vein-based biometric recognition as well.
arXiv Detail & Related papers (2022-06-02T14:20:22Z) - SpatioTemporal Focus for Skeleton-based Action Recognition [66.8571926307011]
Graph convolutional networks (GCNs) are widely adopted in skeleton-based action recognition.
We argue that the performance of recent proposed skeleton-based action recognition methods is limited by the following factors.
Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information.
arXiv Detail & Related papers (2022-03-31T02:45:24Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - UNIK: A Unified Framework for Real-world Skeleton-based Action
Recognition [11.81043814295441]
We introduce UNIK, a novel skeleton-based action recognition method that is able to generalize across datasets.
To study the cross-domain generalizability of action recognition in real-world videos, we re-evaluate state-of-the-art approaches as well as the proposed UNIK.
Results show that the proposed UNIK, with pre-training on Posetics, generalizes well and outperforms state-of-the-art when transferred onto four target action classification datasets.
arXiv Detail & Related papers (2021-07-19T02:00:28Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - JOLO-GCN: Mining Joint-Centered Light-Weight Information for
Skeleton-Based Action Recognition [47.47099206295254]
We propose a novel framework for employing human pose skeleton and joint-centered light-weight information jointly in a two-stream graph convolutional network.
Compared to the pure skeleton-based baseline, this hybrid scheme effectively boosts performance, while keeping the computational and memory overheads low.
arXiv Detail & Related papers (2020-11-16T08:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.