Related papers: PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation

PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation

URL: http://arxiv.org/abs/2510.19475v1
Date: Wed, 22 Oct 2025 11:12:07 GMT
Title: PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation
Authors: Zhuoyang Xie, Yibo Zhao, Hui Huang, Riwei Wang, Zan Gao,
Abstract summary: This work introduces the Pattern Reuse Graph Conal Network (PRGCN), a novel framework that formalizes pose estimation as a problem of pattern retrieval and adaptation.<n>At its core, PRGCN features a graph memory bank that learns and stores a compact set of pose prototypes, encoded as relational graphs, which are dynamically retrieved via an attention mechanism to provide structured priors.<n>Our work posits that PRGCN establishes a new state-of-the-art, achieving an MPJPE of 37.1mm and 13.4mm, respectively, while exhibiting enhanced cross-domain generalization capability.
Score: 18.771349697842947
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Monocular 3D human pose estimation remains a fundamentally ill-posed inverse problem due to the inherent depth ambiguity in 2D-to-3D lifting. While contemporary video-based methods leverage temporal context to enhance spatial reasoning, they operate under a critical paradigm limitation: processing each sequence in isolation, thereby failing to exploit the strong structural regularities and repetitive motion patterns that pervade human movement across sequences. This work introduces the Pattern Reuse Graph Convolutional Network (PRGCN), a novel framework that formalizes pose estimation as a problem of pattern retrieval and adaptation. At its core, PRGCN features a graph memory bank that learns and stores a compact set of pose prototypes, encoded as relational graphs, which are dynamically retrieved via an attention mechanism to provide structured priors. These priors are adaptively fused with hard-coded anatomical constraints through a memory-driven graph convolution, ensuring geometrical plausibility. To underpin this retrieval process with robust spatiotemporal features, we design a dual-stream hybrid architecture that synergistically combines the linear-complexity, local temporal modeling of Mamba-based state-space models with the global relational capacity of self-attention. Extensive evaluations on Human3.6M and MPI-INF-3DHP benchmarks demonstrate that PRGCN establishes a new state-of-the-art, achieving an MPJPE of 37.1mm and 13.4mm, respectively, while exhibiting enhanced cross-domain generalization capability. Our work posits that the long-overlooked mechanism of cross-sequence pattern reuse is pivotal to advancing the field, shifting the paradigm from per-sequence optimization towards cumulative knowledge learning.

Related papers

Towards Geometry-Aware and Motion-Guided Video Human Mesh Recovery [60.51998732898099]
We introduce HMRMamba, a new paradigm for 3D Human Mesh Recovery.<n>It pioneers the use of Structured State Space Models for their efficiency and long-range modeling prowess.<n>Our framework is distinguished by two core contributions. First, the Geometry-Aware Lifting Module, featuring a novel dual-scan Mamba architecture.
arXiv Detail & Related papers (2026-01-29T08:05:02Z)
3D Human Pose Estimation via Spatial Graph Order Attention and Temporal Body Aware Transformer [5.303583360581161]
We propose a new method that exploits the graph modeling capability of GCN to represent each skeleton with multiple graphs of different orders.<n>The resulting spatial features of the sequence are processed using a proposed temporal Body Aware Transformer.<n>Experiments on Human3.6m, MPIINF-3DHP, and HumanEva-I datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2025-05-02T04:58:04Z)
STGFormer: Spatio-Temporal GraphFormer for 3D Human Pose Estimation in Video [7.345621536750547]
This paper presents the S-Temporal GraphFormer framework (STGFormer) for 3D human pose estimation in videos.<n>First, we introduce a STG attention mechanism, designed to more effectively leverage the inherent graph distributions of human body.<n>Next, we present a Modulated Hop-wise Regular GCN to independently process temporal and spatial dimensions in parallel.<n>Finally, we demonstrate our method state-of-the-art performance on the Human3.6M and MPIINF-3DHP datasets.
arXiv Detail & Related papers (2024-07-14T06:45:27Z)
Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture. Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model. Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z)
Overcoming Topology Agnosticism: Enhancing Skeleton-Based Action Recognition through Redefined Skeletal Topology Awareness [24.83836008577395]
Graph Convolutional Networks (GCNs) have long defined the state-of-the-art in skeleton-based action recognition. They tend to optimize the adjacency matrix jointly with the model weights. This process causes a gradual decay of bone connectivity data, culminating in a model indifferent to the very topology it sought to map. We propose an innovative pathway that encodes bone connectivity by harnessing the power of graph distances.
arXiv Detail & Related papers (2023-05-19T06:40:12Z)
Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness. We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT) Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z)
Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation [51.00725889172323]
We propose a Pose-Oriented Transformer (POT) with uncertainty guided refinement for 3D human pose estimation. We first develop novel pose-oriented self-attention mechanism and distance-related position embedding for POT to explicitly exploit the human skeleton topology. We present an Uncertainty-Guided Refinement Network (UGRN) to refine pose predictions from POT, especially for the difficult joints.
arXiv Detail & Related papers (2023-02-15T00:22:02Z)
Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision. This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z)
Back to MLP: A Simple Baseline for Human Motion Prediction [59.18776744541904]
This paper tackles the problem of human motion prediction, consisting in forecasting future body poses from historically observed sequences. We show that the performance of these approaches can be surpassed by a light-weight and purely architectural architecture with only 0.14M parameters. An exhaustive evaluation on Human3.6M, AMASS and 3DPW datasets shows that our method, which we dub siMLPe, consistently outperforms all other approaches.
arXiv Detail & Related papers (2022-07-04T16:35:58Z)
Enhanced 3D Human Pose Estimation from Videos by using Attention-Based Neural Network with Dilated Convolutions [12.900524511984798]
We show a systematic design for how conventional networks and other forms of constraints can be incorporated into the attention framework. We achieve this by adapting temporal receptive field via a multi-scale structure of dilated convolutions. Our method achieves the state-of-the-art performance and outperforms existing methods by reducing the mean per joint position error to 33.4 mm on Human3.6M dataset.
arXiv Detail & Related papers (2021-03-04T17:26:51Z)
On the spatial attention in Spatio-Temporal Graph Convolutional Networks for skeleton-based human action recognition [97.14064057840089]
Graphal networks (GCNs) promising performance in skeleton-based human action recognition by modeling a sequence of skeletons as a graph. Most of the recently proposed G-temporal-based methods improve the performance by learning the graph structure at each layer of the network.
arXiv Detail & Related papers (2020-11-07T19:03:04Z)
Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.