Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose
Estimation
- URL: http://arxiv.org/abs/2109.02303v1
- Date: Mon, 6 Sep 2021 09:06:17 GMT
- Title: Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose
Estimation
- Authors: Ziniu Wan, Zhengjia Li, Maoqing Tian, Jianbo Liu, Shuai Yi, Hongsheng
Li
- Abstract summary: We propose a Multi-level Attention-Decoder Network (MAED) to model multi-level attentions in a unified framework.
With the training set of 3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4 mm of PA-MPJPE.
- Score: 61.98690211671168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D human shape and pose estimation is the essential task for human motion
analysis, which is widely used in many 3D applications. However, existing
methods cannot simultaneously capture the relations at multiple levels,
including spatial-temporal level and human joint level. Therefore they fail to
make accurate predictions in some hard scenarios when there is cluttered
background, occlusion, or extreme pose. To this end, we propose Multi-level
Attention Encoder-Decoder Network (MAED), including a Spatial-Temporal Encoder
(STE) and a Kinematic Topology Decoder (KTD) to model multi-level attentions in
a unified framework. STE consists of a series of cascaded blocks based on
Multi-Head Self-Attention, and each block uses two parallel branches to learn
spatial and temporal attention respectively. Meanwhile, KTD aims at modeling
the joint level attention. It regards pose estimation as a top-down
hierarchical process similar to SMPL kinematic tree. With the training set of
3DPW, MAED outperforms previous state-of-the-art methods by 6.2, 7.2, and 2.4
mm of PA-MPJPE on the three widely used benchmarks 3DPW, MPI-INF-3DHP, and
Human3.6M respectively. Our code is available at
https://github.com/ziniuwan/maed.
Related papers
- Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics.
We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z) - Geometry-Biased Transformer for Robust Multi-View 3D Human Pose
Reconstruction [3.069335774032178]
We propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences.
We conduct experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons.
arXiv Detail & Related papers (2023-12-28T16:30:05Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud
Pre-training [56.81809311892475]
Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers.
We propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds.
arXiv Detail & Related papers (2022-05-28T11:22:53Z) - P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose
Estimation [78.83305967085413]
This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task.
Our method outperforms state-of-the-art methods with fewer parameters and less computational overhead.
arXiv Detail & Related papers (2022-03-15T04:00:59Z) - Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo [71.59494156155309]
Existing approaches for multi-view 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views.
We present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot.
arXiv Detail & Related papers (2021-04-06T03:49:35Z) - Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation [52.94078950641959]
We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation.
We adopt a novel neural representation of multi-person 3D pose which unifies the position of person instances with their corresponding 3D pose representation.
We propose a practical deployment paradigm where paired 2D or 3D pose annotations are unavailable.
arXiv Detail & Related papers (2020-08-04T07:54:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.