View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose
- URL: http://arxiv.org/abs/2010.13321v3
- Date: Thu, 18 Nov 2021 10:03:27 GMT
- Title: View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose
- Authors: Ting Liu, Jennifer J. Sun, Long Zhao, Jiaping Zhao, Liangzhe Yuan,
Yuxiao Wang, Liang-Chieh Chen, Florian Schroff, Hartwig Adam
- Abstract summary: We propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses.
Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views.
- Score: 36.384824115033304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognition of human poses and actions is crucial for autonomous systems to
interact smoothly with people. However, cameras generally capture human poses
in 2D as images and videos, which can have significant appearance variations
across viewpoints that make the recognition tasks challenging. To address this,
we explore recognizing similarity in 3D human body poses from 2D information,
which has not been well-studied in existing works. Here, we propose an approach
to learning a compact view-invariant embedding space from 2D body joint
keypoints, without explicitly predicting 3D poses. Input ambiguities of 2D
poses from projection and occlusion are difficult to represent through a
deterministic mapping, and therefore we adopt a probabilistic formulation for
our embedding space. Experimental results show that our embedding model
achieves higher accuracy when retrieving similar poses across different camera
views, in comparison with 3D pose estimation models. We also show that by
training a simple temporal embedding model, we achieve superior performance on
pose sequence retrieval and largely reduce the embedding dimension from
stacking frame-based embeddings for efficient large-scale retrieval.
Furthermore, in order to enable our embeddings to work with partially visible
input, we further investigate different keypoint occlusion augmentation
strategies during training. We demonstrate that these occlusion augmentations
significantly improve retrieval performance on partial 2D input poses. Results
on action recognition and video alignment demonstrate that using our embeddings
without any additional training achieves competitive performance relative to
other models specifically trained for each task.
Related papers
- Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency [0.493599216374976]
We propose a novel loss function, multiview consistency, to enable adding additional training data with only 2D supervision.
Our experiments demonstrate that two views offset by 90 degrees are enough to obtain good performance, with only marginal improvements by adding more views.
This research introduces new possibilities for domain adaptation in 3D pose estimation, providing a practical and cost-effective solution to customize models for specific applications.
arXiv Detail & Related papers (2023-11-21T08:21:55Z) - CameraPose: Weakly-Supervised Monocular 3D Human Pose Estimation by
Leveraging In-the-wild 2D Annotations [25.05308239278207]
We present CameraPose, a weakly-supervised framework for 3D human pose estimation from a single image.
By adding a camera parameter branch, any in-the-wild 2D annotations can be fed into our pipeline to boost the training diversity.
We also introduce a refinement network module with confidence-guided loss to further improve the quality of noisy 2D keypoints extracted by 2D pose estimators.
arXiv Detail & Related papers (2023-01-08T05:07:41Z) - Optimising 2D Pose Representation: Improve Accuracy, Stability and
Generalisability Within Unsupervised 2D-3D Human Pose Estimation [7.294965109944706]
We show that the most optimal representation of a 2D pose is that of two independent segments, the torso and legs, with no shared features between each lifting network.
Our results show that the most optimal representation of a 2D pose is that of two independent segments, the torso and legs, with no shared features between each lifting network.
arXiv Detail & Related papers (2022-09-01T17:32:52Z) - RiCS: A 2D Self-Occlusion Map for Harmonizing Volumetric Objects [68.85305626324694]
Ray-marching in Camera Space (RiCS) is a new method to represent the self-occlusions of foreground objects in 3D into a 2D self-occlusion map.
We show that our representation map not only allows us to enhance the image quality but also to model temporally coherent complex shadow effects.
arXiv Detail & Related papers (2022-05-14T05:35:35Z) - PONet: Robust 3D Human Pose Estimation via Learning Orientations Only [116.1502793612437]
We propose a novel Pose Orientation Net (PONet) that is able to robustly estimate 3D pose by learning orientations only.
PONet estimates the 3D orientation of these limbs by taking advantage of the local image evidence to recover the 3D pose.
We evaluate our method on multiple datasets, including Human3.6M, MPII, MPI-INF-3DHP, and 3DPW.
arXiv Detail & Related papers (2021-12-21T12:48:48Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage
Optimization [33.02708860641971]
Estimating 3D human poses from a monocular video is still a challenging task.
Many existing methods drop when the target person is cluded by other objects, or the motion is too fast/slow relative to the scale and speed of the training data.
We introduce atemporal-temporal network for robust 3D human pose estimation.
arXiv Detail & Related papers (2020-10-13T15:24:28Z) - Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation [52.94078950641959]
We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation.
We adopt a novel neural representation of multi-person 3D pose which unifies the position of person instances with their corresponding 3D pose representation.
We propose a practical deployment paradigm where paired 2D or 3D pose annotations are unavailable.
arXiv Detail & Related papers (2020-08-04T07:54:25Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z) - Weakly-Supervised 3D Human Pose Learning via Multi-view Images in the
Wild [101.70320427145388]
We propose a weakly-supervised approach that does not require 3D annotations and learns to estimate 3D poses from unlabeled multi-view data.
We evaluate our proposed approach on two large scale datasets.
arXiv Detail & Related papers (2020-03-17T08:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.