UniHPE: Towards Unified Human Pose Estimation via Contrastive Learning
- URL: http://arxiv.org/abs/2311.16477v1
- Date: Fri, 24 Nov 2023 21:55:34 GMT
- Title: UniHPE: Towards Unified Human Pose Estimation via Contrastive Learning
- Authors: Zhongyu Jiang, Wenhao Chai, Lei Li, Zhuoran Zhou, Cheng-Yen Yang,
Jenq-Neng Hwang
- Abstract summary: 2D and 3D Human Pose Estimation (HPE) are two critical perceptual tasks in computer vision.
We propose UniHPE, a unified Human Pose Estimation pipeline, which aligns features from all three modalities.
Our proposed method holds immense potential to advance the field of computer vision and contribute to various applications.
- Score: 29.037799937729687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent times, there has been a growing interest in developing effective
perception techniques for combining information from multiple modalities. This
involves aligning features obtained from diverse sources to enable more
efficient training with larger datasets and constraints, as well as leveraging
the wealth of information contained in each modality. 2D and 3D Human Pose
Estimation (HPE) are two critical perceptual tasks in computer vision, which
have numerous downstream applications, such as Action Recognition,
Human-Computer Interaction, Object tracking, etc. Yet, there are limited
instances where the correlation between Image and 2D/3D human pose has been
clearly researched using a contrastive paradigm. In this paper, we propose
UniHPE, a unified Human Pose Estimation pipeline, which aligns features from
all three modalities, i.e., 2D human pose estimation, lifting-based and
image-based 3D human pose estimation, in the same pipeline. To align more than
two modalities at the same time, we propose a novel singular value based
contrastive learning loss, which better aligns different modalities and further
boosts the performance. In our evaluation, UniHPE achieves remarkable
performance metrics: MPJPE $50.5$mm on the Human3.6M dataset and PAMPJPE
$51.6$mm on the 3DPW dataset. Our proposed method holds immense potential to
advance the field of computer vision and contribute to various applications.
Related papers
- Interpretable 2D Vision Models for 3D Medical Images [47.75089895500738]
This study proposes a simple approach of adapting 2D networks with an intermediate feature representation for processing 3D images.
We show on all 3D MedMNIST datasets as benchmark and two real-world datasets consisting of several hundred high-resolution CT or MRI scans that our approach performs on par with existing methods.
arXiv Detail & Related papers (2023-07-13T08:27:09Z) - Back to Optimization: Diffusion-based Zero-Shot 3D Human Pose Estimation [29.037799937729687]
Learning-based methods have dominated the 3D human pose estimation (HPE) tasks with significantly better performance in most benchmarks than traditional optimization-based methods.
We propose textbfZero-shot textbfDiffusion-based textbfOptimization (textbfZeDO) pipeline for 3D HPE.
Our multi-hypothesis textittextbfZeDO achieves state-of-the-art (SOTA) performance on Human3.6M, with minMPJPE $51.4$
arXiv Detail & Related papers (2023-07-07T21:03:18Z) - AdaptivePose++: A Powerful Single-Stage Network for Multi-Person Pose
Regression [66.39539141222524]
We propose to represent the human parts as adaptive points and introduce a fine-grained body representation method.
With the proposed body representation, we deliver a compact single-stage multi-person pose regression network, termed as AdaptivePose.
We employ AdaptivePose for both 2D/3D multi-person pose estimation tasks to verify the effectiveness of AdaptivePose.
arXiv Detail & Related papers (2022-10-08T12:54:20Z) - Higher-Order Implicit Fairing Networks for 3D Human Pose Estimation [1.1501261942096426]
We introduce a higher-order graph convolutional framework with initial residual connections for 2D-to-3D pose estimation.
Our model is able to capture the long-range dependencies between body joints.
Experiments and ablations studies conducted on two standard benchmarks demonstrate the effectiveness of our model.
arXiv Detail & Related papers (2021-11-01T13:48:55Z) - Graph-Based 3D Multi-Person Pose Estimation Using Multi-View Images [79.70127290464514]
We decompose the task into two stages, i.e. person localization and pose estimation.
And we propose three task-specific graph neural networks for effective message passing.
Our approach achieves state-of-the-art performance on CMU Panoptic and Shelf datasets.
arXiv Detail & Related papers (2021-09-13T11:44:07Z) - View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose [36.384824115033304]
We propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses.
Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views.
arXiv Detail & Related papers (2020-10-23T17:58:35Z) - Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage
Optimization [33.02708860641971]
Estimating 3D human poses from a monocular video is still a challenging task.
Many existing methods drop when the target person is cluded by other objects, or the motion is too fast/slow relative to the scale and speed of the training data.
We introduce atemporal-temporal network for robust 3D human pose estimation.
arXiv Detail & Related papers (2020-10-13T15:24:28Z) - HMOR: Hierarchical Multi-Person Ordinal Relations for Monocular
Multi-Person 3D Pose Estimation [54.23770284299979]
This paper introduces a novel form of supervision - Hierarchical Multi-person Ordinal Relations (HMOR)
HMOR encodes interaction information as the ordinal relations of depths and angles hierarchically.
An integrated top-down model is designed to leverage these ordinal relations in the learning process.
The proposed method significantly outperforms state-of-the-art methods on publicly available multi-person 3D pose datasets.
arXiv Detail & Related papers (2020-08-01T07:53:27Z) - Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View
Geometry [62.29762409558553]
Epipolar constraints are at the core of feature matching and depth estimation in multi-person 3D human pose estimation methods.
Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances.
In this paper, we depart from the multi-person 3D pose estimation formulation, and instead reformulate it as crowd pose estimation.
arXiv Detail & Related papers (2020-07-21T17:59:36Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z) - Weakly-Supervised 3D Human Pose Learning via Multi-view Images in the
Wild [101.70320427145388]
We propose a weakly-supervised approach that does not require 3D annotations and learns to estimate 3D poses from unlabeled multi-view data.
We evaluate our proposed approach on two large scale datasets.
arXiv Detail & Related papers (2020-03-17T08:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.