Related papers: Representation-Centric Survey of Skeletal Action Recognition and the ANUBIS Benchmark

Representation-Centric Survey of Skeletal Action Recognition and the ANUBIS Benchmark

URL: http://arxiv.org/abs/2205.02071v6
Date: Sun, 14 Sep 2025 12:56:37 GMT
Title: Representation-Centric Survey of Skeletal Action Recognition and the ANUBIS Benchmark
Authors: Yang Liu, Jiyao Yang, Madhawa Perera, Pan Ji, Dongwoo Kim, Min Xu, Tianyang Wang, Saeed Anwar, Tom Gedeon, Lei Wang, Zhenyue Qin,
Abstract summary: 3D skeleton-based human action recognition has emerged as a powerful alternative to traditional RGB and depth-based approaches.<n>Despite remarkable progress, current research remains fragmented across diverse input representations.<n>ANUBIS is a large-scale, challenging skeleton action dataset designed to address critical gaps in existing benchmarks.
Score: 43.00059447663327
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D skeleton-based human action recognition has emerged as a powerful alternative to traditional RGB and depth-based approaches, offering robustness to environmental variations, computational efficiency, and enhanced privacy. Despite remarkable progress, current research remains fragmented across diverse input representations and lacks evaluation under scenarios that reflect modern real-world challenges. This paper presents a representation-centric survey of skeleton-based action recognition, systematically categorizing state-of-the-art methods by their input feature types: joint coordinates, bone vectors, motion flows, and extended representations, and analyzing how these choices influence spatial-temporal modeling strategies. Building on the insights from this review, we introduce ANUBIS, a large-scale, challenging skeleton action dataset designed to address critical gaps in existing benchmarks. ANUBIS incorporates multi-view recordings with back-view perspectives, complex multi-person interactions, fine-grained and violent actions, and contemporary social behaviors. We benchmark a diverse set of state-of-the-art models on ANUBIS and conduct an in-depth analysis of how different feature types affect recognition performance across 102 action categories. Our results show strong action-feature dependencies, highlight the limitations of na\"ive multi-representational fusion, and point toward the need for task-aware, semantically aligned integration strategies. This work offers both a comprehensive foundation and a practical benchmarking resource, aiming to guide the next generation of robust, generalizable skeleton-based action recognition systems for complex real-world scenarios. The dataset website, benchmarking framework, and download link are available at https://yliu1082.github.io/ANUBIS/.

Related papers

3D Skeleton-Based Action Recognition: A Review [60.0580120274659]
3D skeleton-based action recognition has become a prominent topic in the field of computer vision.<n>Previous reviews have predominantly adopted a model-oriented perspective, often neglecting the fundamental steps involved in skeleton-based action recognition.<n>This review aims to address these limitations by presenting a comprehensive, task-oriented framework for understanding skeleton-based action recognition.
arXiv Detail & Related papers (2025-06-01T09:04:12Z)
Place Recognition Meet Multiple Modalitie: A Comprehensive Review, Current Challenges and Future Directions [2.4775350526606355]
We review recent advancements in place recognition, emphasizing three methodological paradigms.<n>CNN-based approaches, Transformer-based frameworks, and cross-modal strategies are discussed.<n>We identify current research challenges and outline prospective directions, including domain adaptation, real-time performance, and lifelong learning, to inspire future advancements in this domain.
arXiv Detail & Related papers (2025-05-20T08:16:37Z)
USDRL: Unified Skeleton-Based Dense Representation Learning with Multi-Grained Feature Decorrelation [24.90512145836643]
We introduce a Unified Skeleton-based Dense Representation Learning framework based on feature decorrelation.<n>We show that our approach significantly outperforms the current state-of-the-art (SOTA) approaches.
arXiv Detail & Related papers (2024-12-12T12:20:27Z)
Where Do We Stand with Implicit Neural Representations? A Technical and Performance Survey [16.89460694470542]
Implicit Neural Representations (INRs) have emerged as a paradigm in knowledge representation.<n>INRs leverage multilayer perceptrons (MLPs) to model data as continuous implicit functions.<n>This survey introduces a clear taxonomy that categorises them into four key areas: activation functions, position encoding, combined strategies, and network structure.
arXiv Detail & Related papers (2024-11-06T06:14:24Z)
Persistent Topological Features in Large Language Models [0.6597195879147556]
We introduce topological descriptors that measure how topological features, $p$-dimensional holes, persist and evolve throughout the layers.<n>This offers a statistical perspective on how prompts are rearranged and their relative positions changed in the representation space.<n>As a showcase application, we use zigzag persistence to establish a criterion for layer pruning, achieving results comparable to state-of-the-art methods.
arXiv Detail & Related papers (2024-10-14T19:46:23Z)
An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training. Previous research has focused on aligning sequences' visual and semantic spatial distributions. We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z)
SkeleTR: Towrads Skeleton-based Action Recognition in the Wild [86.03082891242698]
SkeleTR is a new framework for skeleton-based action recognition. It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions. It then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios.
arXiv Detail & Related papers (2023-09-20T16:22:33Z)
One-Shot Action Recognition via Multi-Scale Spatial-Temporal Skeleton Matching [77.6989219290789]
One-shot skeleton action recognition aims to learn a skeleton action recognition model with a single training sample. This paper presents a novel one-shot skeleton action recognition technique that handles skeleton action recognition via multi-scale spatial-temporal feature matching.
arXiv Detail & Related papers (2023-07-14T11:52:10Z)
Adaptive Local-Component-aware Graph Convolutional Network for One-shot Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition. Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z)
Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage. We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets. By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z)
PYSKL: Towards Good Practices for Skeleton Action Recognition [77.87404524458809]
PYSKL is an open-source toolbox for skeleton-based action recognition based on PyTorch. It implements six different algorithms under a unified framework to ease the comparison of efficacy and efficiency. PYSKL supports the training and testing of nine skeleton-based action recognition benchmarks and achieves state-of-the-art recognition performance on eight of them.
arXiv Detail & Related papers (2022-05-19T09:58:32Z)
Towards a Deeper Understanding of Skeleton-based Gait Recognition [4.812321790984493]
In recent years, most gait recognition methods used the person's silhouette to extract the gait features. Model-based methods do not suffer from these problems and are able to represent the temporal motion of body joints. In this work, we propose an approach based on Graph Convolutional Networks (GCNs) that combines higher-order inputs, and residual networks.
arXiv Detail & Related papers (2022-04-16T18:23:37Z)
Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data. Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z)
UNIK: A Unified Framework for Real-world Skeleton-based Action Recognition [11.81043814295441]
We introduce UNIK, a novel skeleton-based action recognition method that is able to generalize across datasets. To study the cross-domain generalizability of action recognition in real-world videos, we re-evaluate state-of-the-art approaches as well as the proposed UNIK. Results show that the proposed UNIK, with pre-training on Posetics, generalizes well and outperforms state-of-the-art when transferred onto four target action classification datasets.
arXiv Detail & Related papers (2021-07-19T02:00:28Z)
Revisiting Skeleton-based Action Recognition [107.08112310075114]
PoseC3D is a new approach to skeleton-based action recognition, which relies on a 3D heatmap instead stack a graph sequence as the base representation of human skeletons. On four challenging datasets, PoseC3D consistently obtains superior performance, when used alone on skeletons and in combination with the RGB modality.
arXiv Detail & Related papers (2021-04-28T06:32:17Z)
Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels. We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects. The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z)
Unifying Graph Embedding Features with Graph Convolutional Networks for Skeleton-based Action Recognition [18.001693718043292]
We propose a novel framework, which unifies 15 graph embedding features into the graph convolutional network for human action recognition. Our model is validated by three large-scale datasets, namely NTU-RGB+D, Kinetics and SYSU-3D.
arXiv Detail & Related papers (2020-03-06T02:31:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.