Related papers: HDFormer: High-order Directed Transformer for 3D Human Pose Estimation

HDFormer: High-order Directed Transformer for 3D Human Pose Estimation

URL: http://arxiv.org/abs/2302.01825v2
Date: Mon, 22 May 2023 06:32:17 GMT
Title: HDFormer: High-order Directed Transformer for 3D Human Pose Estimation
Authors: Hanyuan Chen, Jun-Yan He, Wangmeng Xiang, Zhi-Qi Cheng, Wei Liu, Hanbing Liu, Bin Luo, Yifeng Geng, Xuansong Xie
Abstract summary: HDFormer significantly outperforms state-of-the-art (SOTA) models on Human3.6M and MPI-INF-3DHP datasets. HDFormer demonstrates broad real-world applicability, enabling real-time, accurate 3D pose estimation.
Score: 20.386530242069338
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human pose estimation is a challenging task due to its structured data sequence nature. Existing methods primarily focus on pair-wise interaction of body joints, which is insufficient for scenarios involving overlapping joints and rapidly changing poses. To overcome these issues, we introduce a novel approach, the High-order Directed Transformer (HDFormer), which leverages high-order bone and joint relationships for improved pose estimation. Specifically, HDFormer incorporates both self-attention and high-order attention to formulate a multi-order attention module. This module facilitates first-order "joint$\leftrightarrow$joint", second-order "bone$\leftrightarrow$joint", and high-order "hyperbone$\leftrightarrow$joint" interactions, effectively addressing issues in complex and occlusion-heavy situations. In addition, modern CNN techniques are integrated into the transformer-based architecture, balancing the trade-off between performance and efficiency. HDFormer significantly outperforms state-of-the-art (SOTA) models on Human3.6M and MPI-INF-3DHP datasets, requiring only 1/10 of the parameters and significantly lower computational costs. Moreover, HDFormer demonstrates broad real-world applicability, enabling real-time, accurate 3D pose estimation. The source code is in https://github.com/hyer/HDFormer

Related papers

SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets [72.26350984924129]
We propose a latent space generation paradigm for 3D human digitization. We transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift. We employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset.
arXiv Detail & Related papers (2025-04-09T15:38:18Z)
HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation [61.32714172038278]
We propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a Skeleton-aware Alignment to strengthen token connections. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D
arXiv Detail & Related papers (2025-03-30T06:15:36Z)
Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion [13.938406073551844]
This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation. To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views. Our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements.
arXiv Detail & Related papers (2024-10-06T18:15:27Z)
Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture. Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model. Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z)
DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image [98.29284902879652]
We present DICE, the first end-to-end method for Deformation-aware hand-face Interaction reCovEry from a single image. It features disentangling the regression of local deformation fields and global mesh locations into two network branches. It achieves state-of-the-art performance on a standard benchmark and in-the-wild data in terms of accuracy and physical plausibility.
arXiv Detail & Related papers (2024-06-26T00:08:29Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation [19.028127284305224]
We propose PoseFormerV2, which exploits a compact representation of lengthy skeleton sequences in the frequency domain to efficiently scale up the receptive field. With minimum modifications to PoseFormer, the proposed method effectively fuses features both in the time domain and frequency domain, enjoying a better speed-accuracy trade-off than its precursor.
arXiv Detail & Related papers (2023-03-30T15:45:51Z)
HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation [22.648409352844997]
We propose Hierarchical Spatial-Temporal transFormers (HSTFormer) to capture multi-level joints' spatial-temporal correlations from local to global gradually for accurate 3D human pose estimation. HSTFormer consists of four transformer encoders (TEs) and a fusion module. To the best of our knowledge, HSTFormer is the first to study hierarchical TEs with multi-level fusion. It surpasses recent SOTAs on the challenging MPI-INF-3DHP dataset and small-scale HumanEva dataset, with a highly generalized systematic approach.
arXiv Detail & Related papers (2023-01-18T05:54:02Z)
AdaptivePose++: A Powerful Single-Stage Network for Multi-Person Pose Regression [66.39539141222524]
We propose to represent the human parts as adaptive points and introduce a fine-grained body representation method. With the proposed body representation, we deliver a compact single-stage multi-person pose regression network, termed as AdaptivePose. We employ AdaptivePose for both 2D/3D multi-person pose estimation tasks to verify the effectiveness of AdaptivePose.
arXiv Detail & Related papers (2022-10-08T12:54:20Z)
LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence. Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence. We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z)
3D Human Pose Estimation with Spatial and Temporal Transformers [59.433208652418976]
We present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos. Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure. We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets.
arXiv Detail & Related papers (2021-03-18T18:14:37Z)
HMOR: Hierarchical Multi-Person Ordinal Relations for Monocular Multi-Person 3D Pose Estimation [54.23770284299979]
This paper introduces a novel form of supervision - Hierarchical Multi-person Ordinal Relations (HMOR) HMOR encodes interaction information as the ordinal relations of depths and angles hierarchically. An integrated top-down model is designed to leverage these ordinal relations in the learning process. The proposed method significantly outperforms state-of-the-art methods on publicly available multi-person 3D pose datasets.
arXiv Detail & Related papers (2020-08-01T07:53:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.