Enhanced 3D Human Pose Estimation from Videos by using Attention-Based
Neural Network with Dilated Convolutions
- URL: http://arxiv.org/abs/2103.03170v1
- Date: Thu, 4 Mar 2021 17:26:51 GMT
- Title: Enhanced 3D Human Pose Estimation from Videos by using Attention-Based
Neural Network with Dilated Convolutions
- Authors: Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Cheung, Vijayan K.
Asari
- Abstract summary: We show a systematic design for how conventional networks and other forms of constraints can be incorporated into the attention framework.
We achieve this by adapting temporal receptive field via a multi-scale structure of dilated convolutions.
Our method achieves the state-of-the-art performance and outperforms existing methods by reducing the mean per joint position error to 33.4 mm on Human3.6M dataset.
- Score: 12.900524511984798
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The attention mechanism provides a sequential prediction framework for
learning spatial models with enhanced implicit temporal consistency. In this
work, we show a systematic design (from 2D to 3D) for how conventional networks
and other forms of constraints can be incorporated into the attention framework
for learning long-range dependencies for the task of pose estimation. The
contribution of this paper is to provide a systematic approach for designing
and training of attention-based models for the end-to-end pose estimation, with
the flexibility and scalability of arbitrary video sequences as input. We
achieve this by adapting temporal receptive field via a multi-scale structure
of dilated convolutions. Besides, the proposed architecture can be easily
adapted to a causal model enabling real-time performance. Any off-the-shelf 2D
pose estimation systems, e.g. Mocap libraries, can be easily integrated in an
ad-hoc fashion. Our method achieves the state-of-the-art performance and
outperforms existing methods by reducing the mean per joint position error to
33.4 mm on Human3.6M dataset.
Related papers
- STGFormer: Spatio-Temporal GraphFormer for 3D Human Pose Estimation in Video [7.345621536750547]
This paper presents a graph-based framework for 3D human pose estimation in video.
Specifically, we develop a graph-based attention mechanism, integrating graph information directly into the respective attention layers.
We demonstrate that our method achieves significant stateof-the-art performance in 3D human pose estimation.
arXiv Detail & Related papers (2024-07-14T06:45:27Z) - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation.
It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z) - Spatio-temporal MLP-graph network for 3D human pose estimation [8.267311047244881]
Graph convolutional networks and their variants have shown significant promise in 3D human pose estimation.
We introduce a new weighted Jacobi feature rule obtained through graph filtering with implicit propagation fairing.
We also employ adjacency modulation with the aim of learning meaningful correlations beyond defined between body joints.
arXiv Detail & Related papers (2023-08-29T14:00:55Z) - DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose
Estimation [16.32910684198013]
We present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem.
We show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model.
arXiv Detail & Related papers (2023-07-31T14:00:23Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - 3D Convolutional with Attention for Action Recognition [6.238518976312625]
Current action recognition methods use computationally expensive models for learning-temporal dependencies of the action.
This paper proposes a deep neural network architecture for learning such dependencies consisting of a 3D convolutional layer, fully connected layers and attention layer.
The method first learns spatial features and temporal of actions through 3D-CNN, and then the attention temporal mechanism helps the model to locate attention to essential features.
arXiv Detail & Related papers (2022-06-05T15:12:57Z) - Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose
Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations.
We derive suitable measures to quantify prediction uncertainty at both pose and joint level.
We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z) - Stereo Neural Vernier Caliper [57.187088191829886]
We propose a new object-centric framework for learning-based stereo 3D object detection.
We tackle a problem of how to predict a refined update given an initial 3D cuboid guess.
Our approach achieves state-of-the-art performance on the KITTI benchmark.
arXiv Detail & Related papers (2022-03-21T14:36:07Z) - Self-Supervised Multi-Frame Monocular Scene Flow [61.588808225321735]
We introduce a multi-frame monocular scene flow network based on self-supervised learning.
We observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning.
arXiv Detail & Related papers (2021-05-05T17:49:55Z) - Kinematic-Structure-Preserved Representation for Unsupervised 3D Human
Pose Estimation [58.72192168935338]
Generalizability of human pose estimation models developed using supervision on large-scale in-studio datasets remains questionable.
We propose a novel kinematic-structure-preserved unsupervised 3D pose estimation framework, which is not restrained by any paired or unpaired weak supervisions.
Our proposed model employs three consecutive differentiable transformations named as forward-kinematics, camera-projection and spatial-map transformation.
arXiv Detail & Related papers (2020-06-24T23:56:33Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.