STGFormer: Spatio-Temporal GraphFormer for 3D Human Pose Estimation in Video
- URL: http://arxiv.org/abs/2407.10099v2
- Date: Wed, 26 Feb 2025 07:56:48 GMT
- Title: STGFormer: Spatio-Temporal GraphFormer for 3D Human Pose Estimation in Video
- Authors: Yang Liu, Zhiyong Zhang,
- Abstract summary: This paper presents the S-Temporal GraphFormer framework (STGFormer) for 3D human pose estimation in videos.<n>First, we introduce a STG attention mechanism, designed to more effectively leverage the inherent graph distributions of human body.<n>Next, we present a Modulated Hop-wise Regular GCN to independently process temporal and spatial dimensions in parallel.<n>Finally, we demonstrate our method state-of-the-art performance on the Human3.6M and MPIINF-3DHP datasets.
- Score: 7.345621536750547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The current methods of video-based 3D human pose estimation have achieved significant progress.However, they still face pressing challenges, such as the underutilization of spatiotemporal bodystructure features in transformers and the inadequate granularity of spatiotemporal interaction modeling in graph convolutional networks, which leads to pervasive depth ambiguity in monocular 3D human pose estimation. To address these limitations, this paper presents the Spatio-Temporal GraphFormer framework (STGFormer) for 3D human pose estimation in videos. First, we introduce a Spatio-Temporal criss-cross Graph (STG) attention mechanism, designed to more effectively leverage the inherent graph priors of the human body within continuous sequence distributions while capturing spatiotemporal long-range dependencies. Next, we present a dual-path Modulated Hop-wise Regular GCN (MHR-GCN) to independently process temporal and spatial dimensions in parallel, preserving features rich in temporal dynamics and the original or high-dimensional representations of spatial structures. Furthermore, the module leverages modulation to optimize parameter efficiency and incorporates spatiotemporal hop-wise skip connections to capture higher-order information. Finally, we demonstrate that our method achieves state-of-the-art performance on the Human3.6M and MPIINF-3DHP datasets.
Related papers
- GAST: Sequential Gaussian Avatars with Hierarchical Spatio-temporal Context [7.6736633105043515]
3D human avatars, through the use of canonical radiance fields and per-frame observed warping, enable high-fidelity rendering and animating.
Existing methods, which rely on either spatial SMPL(-X) poses or temporal embeddings, respectively suffer from coarse quality or limited animation flexibility.
We propose GAST, a framework that unifies 3D human modeling with 3DGS by hierarchically integrating both spatial and temporal information.
arXiv Detail & Related papers (2024-11-25T04:05:19Z) - Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture.
Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model.
Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion
Modeling [83.76377808476039]
We propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior.
Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton.
A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence.
arXiv Detail & Related papers (2023-08-18T16:41:57Z) - Spatio-temporal Tendency Reasoning for Human Body Pose and Shape
Estimation from Videos [10.50306784245168]
We present atemporal tendency reasoning (STR) network for recovering human body pose shape from videos.
Our STR aims to learn accurate and spatial motion sequences in an unconstrained environment.
Our STR remains competitive with the state-of-the-art on three datasets.
arXiv Detail & Related papers (2022-10-07T16:09:07Z) - Multivariate Time Series Forecasting with Dynamic Graph Neural ODEs [65.18780403244178]
We propose a continuous model to forecast Multivariate Time series with dynamic Graph neural Ordinary Differential Equations (MTGODE)
Specifically, we first abstract multivariate time series into dynamic graphs with time-evolving node features and unknown graph structures.
Then, we design and solve a neural ODE to complement missing graph topologies and unify both spatial and temporal message passing.
arXiv Detail & Related papers (2022-02-17T02:17:31Z) - Graph-Based 3D Multi-Person Pose Estimation Using Multi-View Images [79.70127290464514]
We decompose the task into two stages, i.e. person localization and pose estimation.
And we propose three task-specific graph neural networks for effective message passing.
Our approach achieves state-of-the-art performance on CMU Panoptic and Shelf datasets.
arXiv Detail & Related papers (2021-09-13T11:44:07Z) - Multiscale Spatio-Temporal Graph Neural Networks for 3D Skeleton-Based
Motion Prediction [92.16318571149553]
We propose a multiscale-temporal graph neural network (MST-GNN) to predict the future 3D-based skeleton human poses.
The MST-GNN outperforms state-of-the-art methods in both short and long-term motion prediction.
arXiv Detail & Related papers (2021-08-25T14:05:37Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - Enhanced 3D Human Pose Estimation from Videos by using Attention-Based
Neural Network with Dilated Convolutions [12.900524511984798]
We show a systematic design for how conventional networks and other forms of constraints can be incorporated into the attention framework.
We achieve this by adapting temporal receptive field via a multi-scale structure of dilated convolutions.
Our method achieves the state-of-the-art performance and outperforms existing methods by reducing the mean per joint position error to 33.4 mm on Human3.6M dataset.
arXiv Detail & Related papers (2021-03-04T17:26:51Z) - On the spatial attention in Spatio-Temporal Graph Convolutional Networks
for skeleton-based human action recognition [97.14064057840089]
Graphal networks (GCNs) promising performance in skeleton-based human action recognition by modeling a sequence of skeletons as a graph.
Most of the recently proposed G-temporal-based methods improve the performance by learning the graph structure at each layer of the network.
arXiv Detail & Related papers (2020-11-07T19:03:04Z) - Disentangling and Unifying Graph Convolutions for Skeleton-Based Action
Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D.
By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.