Related papers: PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation

PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation

URL: http://arxiv.org/abs/2508.17239v2
Date: Tue, 26 Aug 2025 05:19:31 GMT
Title: PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation
Authors: Xiaoyang Hao, Han Li,
Abstract summary: We propose a novel 3D human pose estimation (HPE) framework, PersPose.<n>PersPose achieves state-of-the-art (SOTA) performance on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
Score: 8.604338422941712
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Monocular 3D human pose estimation (HPE) methods estimate the 3D positions of joints from individual images. Existing 3D HPE approaches often use the cropped image alone as input for their models. However, the relative depths of joints cannot be accurately estimated from cropped images without the corresponding camera intrinsics, which determine the perspective relationship between 3D objects and the cropped images. In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. Moreover, since the human subject can appear anywhere within the original image, the perspective relationship between the 3D scene and the cropped image differs significantly, which complicates model fitting. Additionally, the further the human subject deviates from the image center, the greater the perspective distortions in the cropped image. To address these issues, we propose Perspective Rotation (PR), a transformation applied to the original image that centers the human subject, thereby reducing perspective distortions and alleviating the difficulty of model fitting. By incorporating PE and PR, we propose a novel 3D HPE framework, PersPose. Experimental results demonstrate that PersPose achieves state-of-the-art (SOTA) performance on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets. For example, on the in-the-wild dataset 3DPW, PersPose achieves an MPJPE of 60.1 mm, 7.54% lower than the previous SOTA approach. Code is available at: https://github.com/KenAdamsJoseph/PersPose.

Related papers

PandaPose: 3D Human Pose Lifting from a Single Image via Propagating 2D Pose Prior to 3D Anchor Space [62.10630827126755]
PandaPose is a 3D human pose lifting approach via propagating 2D pose prior to 3D anchor space as the unified intermediate representation.<n>Our 3D anchor space comprises: (1) Joint-wise 3D anchors in the canonical coordinate system, providing accurate and robust priors to mitigate 2D pose estimation inaccuracies.
arXiv Detail & Related papers (2026-02-01T08:20:40Z)
GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers [23.96688843662126]
Reconstructing posed 3D human models from monocular images has important applications in the sports industry.<n>We combine 3D human pose and shape estimation with 3D Gaussian Splatting (3DGS), a representation of the scene composed of a mixture of Gaussians.<n>We show that this combination can achieve near real-time inference of 3D human models from a single image without expensive diffusion models or 3D points supervision.
arXiv Detail & Related papers (2024-09-06T11:34:24Z)
SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views [36.02533658048349]
We propose a novel method, SpaRP, to reconstruct a 3D textured mesh and estimate the relative camera poses for sparse-view images. SpaRP distills knowledge from 2D diffusion models and finetunes them to implicitly deduce the 3D spatial relationships between the sparse views. It requires only about 20 seconds to produce a textured mesh and camera poses for the input views.
arXiv Detail & Related papers (2024-08-19T17:53:10Z)
Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops [17.074716363691294]
Models for predicting 3D from a single image often work with crops around the object of interest and ignore the location of the object in the camera's field of view. We propose Intrinsics-Aware Positional. benchmarks (KPE), which incorporates information about the location of crops in the image and camera shapes. Experiments on three popular 3D-from-a-single-image benchmarks: depth prediction on NYU, 3D object detection on KITTI & nuScenes, and predicting 3D of articulated objects on ARCTIC, show the benefits of KPE.
arXiv Detail & Related papers (2023-12-11T18:28:55Z)
Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video [23.93644678238666]
We propose a Pose and Mesh Co-Evolution network (PMCE) to recover 3D human motion from a video. The proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency.
arXiv Detail & Related papers (2023-08-20T16:03:21Z)
Zolly: Zoom Focal Length Correctly for Perspective-Distorted Human Mesh Reconstruction [66.10717041384625]
Zolly is the first 3DHMR method focusing on perspective-distorted images. We propose a new camera model and a novel 2D representation, termed distortion image, which describes the 2D dense distortion scale of the human body. We extend two real-world datasets tailored for this task, all containing perspective-distorted human images.
arXiv Detail & Related papers (2023-03-24T04:22:41Z)
Depth-based 6DoF Object Pose Estimation using Swin Transformer [1.14219428942199]
Accurately estimating the 6D pose of objects is crucial for many applications, such as robotic grasping, autonomous driving, and augmented reality. We propose a novel framework called SwinDePose, that uses only geometric information from depth images to achieve accurate 6D pose estimation. In experiments on the LineMod and Occlusion LineMod datasets, SwinDePose outperforms existing state-of-the-art methods for 6D object pose estimation using depth images.
arXiv Detail & Related papers (2023-03-03T18:25:07Z)
Shape and Viewpoint without Keypoints [63.26977130704171]
We present a learning framework that learns to recover the 3D shape, pose and texture from a single image. We trained on an image collection without any ground truth 3D shape, multi-view, camera viewpoints or keypoint supervision. We obtain state-of-the-art camera prediction results and show that we can learn to predict diverse shapes and textures across objects.
arXiv Detail & Related papers (2020-07-21T17:58:28Z)
Towards Generalization of 3D Human Pose Estimation In The Wild [73.19542580408971]
3DBodyTex.Pose is a dataset that addresses the task of 3D human pose estimation in-the-wild. 3DBodyTex.Pose offers high quality and rich data containing 405 different real subjects in various clothing and poses, and 81k image samples with ground-truth 2D and 3D pose annotations.
arXiv Detail & Related papers (2020-04-21T13:31:58Z)
Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames. Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
Fusing Wearable IMUs with Multi-View Images for Human Pose Estimation: A Geometric Approach [76.10879433430466]
We propose to estimate 3D human pose from multi-view images and a few IMUs attached at person's limbs. It operates by firstly detecting 2D poses from the two signals, and then lifting them to the 3D space. The simple two-step approach reduces the error of the state-of-the-art by a large margin on a public dataset.
arXiv Detail & Related papers (2020-03-25T00:26:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.