UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning
- URL: http://arxiv.org/abs/2510.19078v1
- Date: Tue, 21 Oct 2025 21:06:51 GMT
- Title: UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning
- Authors: Zhongyu Jiang, Wenhao Chai, Lei Li, Zhuoran Zhou, Cheng-Yen Yang, Jenq-Neng Hwang,
- Abstract summary: We propose UniHPR, a unified Human Pose Representation learning pipeline, which aligns Human Pose embeddings from images, 2D and 3D human poses.<n>In our evaluation, with a simple 3D human pose decoder, UniHPR achieves remarkable performance metrics.
- Score: 45.892775193282546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, there has been a growing interest in developing effective alignment pipelines to generate unified representations from different modalities for multi-modal fusion and generation. As an important component of Human-Centric applications, Human Pose representations are critical in many downstream tasks, such as Human Pose Estimation, Action Recognition, Human-Computer Interaction, Object tracking, etc. Human Pose representations or embeddings can be extracted from images, 2D keypoints, 3D skeletons, mesh models, and lots of other modalities. Yet, there are limited instances where the correlation among all of those representations has been clearly researched using a contrastive paradigm. In this paper, we propose UniHPR, a unified Human Pose Representation learning pipeline, which aligns Human Pose embeddings from images, 2D and 3D human poses. To align more than two data representations at the same time, we propose a novel singular value-based contrastive learning loss, which better aligns different modalities and further boosts performance. To evaluate the effectiveness of the aligned representation, we choose 2D and 3D Human Pose Estimation (HPE) as our evaluation tasks. In our evaluation, with a simple 3D human pose decoder, UniHPR achieves remarkable performance metrics: MPJPE 49.9mm on the Human3.6M dataset and PA-MPJPE 51.6mm on the 3DPW dataset with cross-domain evaluation. Meanwhile, we are able to achieve 2D and 3D pose retrieval with our unified human pose representations in Human3.6M dataset, where the retrieval error is 9.24mm in MPJPE.
Related papers
- Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild [29.18347483848261]
Single-view 3D human reconstruction has achieved remarkable progress, yet the recovered 3D humans often exhibit unnatural poses.<n>We introduce DrPose, Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses.<n>DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore.
arXiv Detail & Related papers (2026-03-03T05:47:18Z) - StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset [56.71580976007712]
We propose to use the Human-Object Offset between anchors which are densely sampled from the surface of human mesh and object mesh to represent human-object spatial relation.
Based on this representation, we propose Stacked Normalizing Flow (StackFLOW) to infer the posterior distribution of human-object spatial relations from the image.
During the optimization stage, we finetune the human body pose and object 6D pose by maximizing the likelihood of samples.
arXiv Detail & Related papers (2024-07-30T04:57:21Z) - Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers [28.38686299271394]
We propose a framework for 3D sequence-to-sequence (seq2seq) human pose detection.
Firstly, the spatial module represents the human pose feature by intra-image content, while the frame-image relation module extracts temporal relationships.
Our method is evaluated on Human3.6M, a popular 3D human pose detection dataset.
arXiv Detail & Related papers (2024-01-30T03:00:25Z) - UniHPE: Towards Unified Human Pose Estimation via Contrastive Learning [29.037799937729687]
2D and 3D Human Pose Estimation (HPE) are two critical perceptual tasks in computer vision.
We propose UniHPE, a unified Human Pose Estimation pipeline, which aligns features from all three modalities.
Our proposed method holds immense potential to advance the field of computer vision and contribute to various applications.
arXiv Detail & Related papers (2023-11-24T21:55:34Z) - AdaptivePose++: A Powerful Single-Stage Network for Multi-Person Pose
Regression [66.39539141222524]
We propose to represent the human parts as adaptive points and introduce a fine-grained body representation method.
With the proposed body representation, we deliver a compact single-stage multi-person pose regression network, termed as AdaptivePose.
We employ AdaptivePose for both 2D/3D multi-person pose estimation tasks to verify the effectiveness of AdaptivePose.
arXiv Detail & Related papers (2022-10-08T12:54:20Z) - Adapted Human Pose: Monocular 3D Human Pose Estimation with Zero Real 3D
Pose Data [14.719976311208502]
Training vs. test data domain gaps often negatively affect model performance.
We present our adapted human pose (AHuP) approach that addresses adaptation problems in both appearance and pose spaces.
AHuP is built around a practical assumption that in real applications, data from target domain could be inaccessible or only limited information can be acquired.
arXiv Detail & Related papers (2021-05-23T01:20:40Z) - Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View
Geometry [62.29762409558553]
Epipolar constraints are at the core of feature matching and depth estimation in multi-person 3D human pose estimation methods.
Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances.
In this paper, we depart from the multi-person 3D pose estimation formulation, and instead reformulate it as crowd pose estimation.
arXiv Detail & Related papers (2020-07-21T17:59:36Z) - Unsupervised 3D Human Pose Representation with Viewpoint and Pose
Disentanglement [63.853412753242615]
Learning a good 3D human pose representation is important for human pose related tasks.
We propose a novel Siamese denoising autoencoder to learn a 3D pose representation.
Our approach achieves state-of-the-art performance on two inherently different tasks.
arXiv Detail & Related papers (2020-07-14T14:25:22Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.