HDFormer: High-order Directed Transformer for 3D Human Pose Estimation
- URL: http://arxiv.org/abs/2302.01825v2
- Date: Mon, 22 May 2023 06:32:17 GMT
- Title: HDFormer: High-order Directed Transformer for 3D Human Pose Estimation
- Authors: Hanyuan Chen, Jun-Yan He, Wangmeng Xiang, Zhi-Qi Cheng, Wei Liu,
Hanbing Liu, Bin Luo, Yifeng Geng, Xuansong Xie
- Abstract summary: HDFormer significantly outperforms state-of-the-art (SOTA) models on Human3.6M and MPI-INF-3DHP datasets.
HDFormer demonstrates broad real-world applicability, enabling real-time, accurate 3D pose estimation.
- Score: 20.386530242069338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human pose estimation is a challenging task due to its structured data
sequence nature. Existing methods primarily focus on pair-wise interaction of
body joints, which is insufficient for scenarios involving overlapping joints
and rapidly changing poses. To overcome these issues, we introduce a novel
approach, the High-order Directed Transformer (HDFormer), which leverages
high-order bone and joint relationships for improved pose estimation.
Specifically, HDFormer incorporates both self-attention and high-order
attention to formulate a multi-order attention module. This module facilitates
first-order "joint$\leftrightarrow$joint", second-order
"bone$\leftrightarrow$joint", and high-order "hyperbone$\leftrightarrow$joint"
interactions, effectively addressing issues in complex and occlusion-heavy
situations. In addition, modern CNN techniques are integrated into the
transformer-based architecture, balancing the trade-off between performance and
efficiency. HDFormer significantly outperforms state-of-the-art (SOTA) models
on Human3.6M and MPI-INF-3DHP datasets, requiring only 1/10 of the parameters
and significantly lower computational costs. Moreover, HDFormer demonstrates
broad real-world applicability, enabling real-time, accurate 3D pose
estimation. The source code is in https://github.com/hyer/HDFormer
Related papers
- Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion [13.938406073551844]
This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation.
To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views.
Our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements.
arXiv Detail & Related papers (2024-10-06T18:15:27Z) - Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture.
Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model.
Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z) - DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image [98.29284902879652]
We present DICE, the first end-to-end method for Deformation-aware hand-face Interaction reCovEry from a single image.
It features disentangling the regression of local deformation fields and global mesh locations into two network branches.
It achieves state-of-the-art performance on a standard benchmark and in-the-wild data in terms of accuracy and physical plausibility.
arXiv Detail & Related papers (2024-06-26T00:08:29Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D
Human Pose Estimation [19.028127284305224]
We propose PoseFormerV2, which exploits a compact representation of lengthy skeleton sequences in the frequency domain to efficiently scale up the receptive field.
With minimum modifications to PoseFormer, the proposed method effectively fuses features both in the time domain and frequency domain, enjoying a better speed-accuracy trade-off than its precursor.
arXiv Detail & Related papers (2023-03-30T15:45:51Z) - HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose
Estimation [22.648409352844997]
We propose Hierarchical Spatial-Temporal transFormers (HSTFormer) to capture multi-level joints' spatial-temporal correlations from local to global gradually for accurate 3D human pose estimation.
HSTFormer consists of four transformer encoders (TEs) and a fusion module. To the best of our knowledge, HSTFormer is the first to study hierarchical TEs with multi-level fusion.
It surpasses recent SOTAs on the challenging MPI-INF-3DHP dataset and small-scale HumanEva dataset, with a highly generalized systematic approach.
arXiv Detail & Related papers (2023-01-18T05:54:02Z) - AdaptivePose++: A Powerful Single-Stage Network for Multi-Person Pose
Regression [66.39539141222524]
We propose to represent the human parts as adaptive points and introduce a fine-grained body representation method.
With the proposed body representation, we deliver a compact single-stage multi-person pose regression network, termed as AdaptivePose.
We employ AdaptivePose for both 2D/3D multi-person pose estimation tasks to verify the effectiveness of AdaptivePose.
arXiv Detail & Related papers (2022-10-08T12:54:20Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z) - 3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos.
Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure.
We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z) - HMOR: Hierarchical Multi-Person Ordinal Relations for Monocular
Multi-Person 3D Pose Estimation [54.23770284299979]
This paper introduces a novel form of supervision - Hierarchical Multi-person Ordinal Relations (HMOR)
HMOR encodes interaction information as the ordinal relations of depths and angles hierarchically.
An integrated top-down model is designed to leverage these ordinal relations in the learning process.
The proposed method significantly outperforms state-of-the-art methods on publicly available multi-person 3D pose datasets.
arXiv Detail & Related papers (2020-08-01T07:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.