Related papers: Enhanced 3D Human Pose Estimation from Videos by using Attention-Based Neural Network with Dilated Convolutions

Enhanced 3D Human Pose Estimation from Videos by using Attention-Based Neural Network with Dilated Convolutions

URL: http://arxiv.org/abs/2103.03170v1
Date: Thu, 4 Mar 2021 17:26:51 GMT
Title: Enhanced 3D Human Pose Estimation from Videos by using Attention-Based Neural Network with Dilated Convolutions
Authors: Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Cheung, Vijayan K. Asari
Abstract summary: We show a systematic design for how conventional networks and other forms of constraints can be incorporated into the attention framework. We achieve this by adapting temporal receptive field via a multi-scale structure of dilated convolutions. Our method achieves the state-of-the-art performance and outperforms existing methods by reducing the mean per joint position error to 33.4 mm on Human3.6M dataset.
Score: 12.900524511984798
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The attention mechanism provides a sequential prediction framework for learning spatial models with enhanced implicit temporal consistency. In this work, we show a systematic design (from 2D to 3D) for how conventional networks and other forms of constraints can be incorporated into the attention framework for learning long-range dependencies for the task of pose estimation. The contribution of this paper is to provide a systematic approach for designing and training of attention-based models for the end-to-end pose estimation, with the flexibility and scalability of arbitrary video sequences as input. We achieve this by adapting temporal receptive field via a multi-scale structure of dilated convolutions. Besides, the proposed architecture can be easily adapted to a causal model enabling real-time performance. Any off-the-shelf 2D pose estimation systems, e.g. Mocap libraries, can be easily integrated in an ad-hoc fashion. Our method achieves the state-of-the-art performance and outperforms existing methods by reducing the mean per joint position error to 33.4 mm on Human3.6M dataset.

Related papers

Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding [80.66591664266744]
Lemon is a unified transformer architecture that processes 3D point cloud patches and language tokens as a single sequence.<n>To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context.<n>Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks.
arXiv Detail & Related papers (2025-12-14T20:02:43Z)
Adaptive graph Kolmogorov-Arnold network for 3D human pose estimation [3.3946853660795884]
Graph convolutional network (GCN)-based methods have shown strong performance in 3D human pose estimation.<n>We introduce PoseKAN, a framework that extends KANs to graph-based learning for 2D-to-3D pose lifting from a single image.<n>Our model employs multi-hop feature aggregation, ensuring the body joints can leverage information from both local and distant neighbors.
arXiv Detail & Related papers (2025-11-11T22:23:24Z)
PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation [18.771349697842947]
This work introduces the Pattern Reuse Graph Conal Network (PRGCN), a novel framework that formalizes pose estimation as a problem of pattern retrieval and adaptation.<n>At its core, PRGCN features a graph memory bank that learns and stores a compact set of pose prototypes, encoded as relational graphs, which are dynamically retrieved via an attention mechanism to provide structured priors.<n>Our work posits that PRGCN establishes a new state-of-the-art, achieving an MPJPE of 37.1mm and 13.4mm, respectively, while exhibiting enhanced cross-domain generalization capability.
arXiv Detail & Related papers (2025-10-22T11:12:07Z)
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks. We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z)
STGFormer: Spatio-Temporal GraphFormer for 3D Human Pose Estimation in Video [7.345621536750547]
This paper presents a graph-based framework for 3D human pose estimation in video. Specifically, we develop a graph-based attention mechanism, integrating graph information directly into the respective attention layers. We demonstrate that our method achieves significant stateof-the-art performance in 3D human pose estimation.
arXiv Detail & Related papers (2024-07-14T06:45:27Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
Spatio-temporal MLP-graph network for 3D human pose estimation [8.267311047244881]
Graph convolutional networks and their variants have shown significant promise in 3D human pose estimation. We introduce a new weighted Jacobi feature rule obtained through graph filtering with implicit propagation fairing. We also employ adjacency modulation with the aim of learning meaningful correlations beyond defined between body joints.
arXiv Detail & Related papers (2023-08-29T14:00:55Z)
DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation [16.32910684198013]
We present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem. We show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model.
arXiv Detail & Related papers (2023-07-31T14:00:23Z)
3D Convolutional with Attention for Action Recognition [6.238518976312625]
Current action recognition methods use computationally expensive models for learning-temporal dependencies of the action. This paper proposes a deep neural network architecture for learning such dependencies consisting of a 3D convolutional layer, fully connected layers and attention layer. The method first learns spatial features and temporal of actions through 3D-CNN, and then the attention temporal mechanism helps the model to locate attention to essential features.
arXiv Detail & Related papers (2022-06-05T15:12:57Z)
Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations. We derive suitable measures to quantify prediction uncertainty at both pose and joint level. We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z)
Stereo Neural Vernier Caliper [57.187088191829886]
We propose a new object-centric framework for learning-based stereo 3D object detection. We tackle a problem of how to predict a refined update given an initial 3D cuboid guess. Our approach achieves state-of-the-art performance on the KITTI benchmark.
arXiv Detail & Related papers (2022-03-21T14:36:07Z)
Self-Supervised Multi-Frame Monocular Scene Flow [61.588808225321735]
We introduce a multi-frame monocular scene flow network based on self-supervised learning. We observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning.
arXiv Detail & Related papers (2021-05-05T17:49:55Z)
Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation [58.72192168935338]
Generalizability of human pose estimation models developed using supervision on large-scale in-studio datasets remains questionable. We propose a novel kinematic-structure-preserved unsupervised 3D pose estimation framework, which is not restrained by any paired or unpaired weak supervisions. Our proposed model employs three consecutive differentiable transformations named as forward-kinematics, camera-projection and spatial-map transformation.
arXiv Detail & Related papers (2020-06-24T23:56:33Z)
A Graph Attention Spatio-temporal Convolutional Network for 3D Human Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms. Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.