Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and
Bottom-Up Networks
- URL: http://arxiv.org/abs/2104.01797v2
- Date: Wed, 7 Apr 2021 06:22:10 GMT
- Title: Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and
Bottom-Up Networks
- Authors: Yu Cheng, Bo Wang, Bo Yang, Robby T. Tan
- Abstract summary: Multi-person pose estimation can cause human detection to be erroneous and human-joints grouping to be unreliable.
Existing top-down methods rely on human detection and thus suffer from these problems.
We propose the integration of top-down and bottom-up approaches to exploit their strengths.
- Score: 33.974241749058585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In monocular video 3D multi-person pose estimation, inter-person occlusion
and close interactions can cause human detection to be erroneous and
human-joints grouping to be unreliable. Existing top-down methods rely on human
detection and thus suffer from these problems. Existing bottom-up methods do
not use human detection, but they process all persons at once at the same
scale, causing them to be sensitive to multiple-persons scale variations. To
address these challenges, we propose the integration of top-down and bottom-up
approaches to exploit their strengths. Our top-down network estimates human
joints from all persons instead of one in an image patch, making it robust to
possible erroneous bounding boxes. Our bottom-up network incorporates
human-detection based normalized heatmaps, allowing the network to be more
robust in handling scale variations. Finally, the estimated 3D poses from the
top-down and bottom-up networks are fed into our integration network for final
3D poses. Besides the integration of top-down and bottom-up networks, unlike
existing pose discriminators that are designed solely for single person, and
consequently cannot assess natural inter-person interactions, we propose a
two-person pose discriminator that enforces natural two-person interactions.
Lastly, we also apply a semi-supervised method to overcome the 3D ground-truth
data scarcity. Our quantitative and qualitative evaluations show the
effectiveness of our method compared to the state-of-the-art baselines.
Related papers
- AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation [55.179287851188036]
We introduce a novel all-in-one-stage framework, AiOS, for expressive human pose and shape recovery without an additional human detection step.
We first employ a human token to probe a human location in the image and encode global features for each instance.
Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature.
arXiv Detail & Related papers (2024-03-26T17:59:23Z) - Explicit Occlusion Reasoning for Multi-person 3D Human Pose Estimation [33.86986028882488]
Occlusion poses a great threat to monocular multi-person 3D human pose estimation due to large variability in terms of the shape, appearance, and position of occluders.
Existing methods try to handle occlusion with pose priors/constraints, data augmentation, or implicit reasoning.
We develop a method to explicitly model this process that significantly improves bottom-up multi-person human pose estimation.
arXiv Detail & Related papers (2022-07-29T22:12:50Z) - Dual networks based 3D Multi-Person Pose Estimation from Monocular Video [42.01876518017639]
Multi-person 3D pose estimation is more challenging than single pose estimation.
Existing top-down and bottom-up approaches to pose estimation suffer from detection errors.
We propose the integration of top-down and bottom-up approaches to exploit their strengths.
arXiv Detail & Related papers (2022-05-02T08:53:38Z) - Non-Local Latent Relation Distillation for Self-Adaptive 3D Human Pose
Estimation [63.199549837604444]
3D human pose estimation approaches leverage different forms of strong (2D/3D pose) or weak (multi-view or depth) paired supervision.
We cast 3D pose learning as a self-supervised adaptation problem that aims to transfer the task knowledge from a labeled source domain to a completely unpaired target.
We evaluate different self-adaptation settings and demonstrate state-of-the-art 3D human pose estimation performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-05T03:52:57Z) - Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose
Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations.
We derive suitable measures to quantify prediction uncertainty at both pose and joint level.
We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z) - Perceiving Humans: from Monocular 3D Localization to Social Distancing [93.03056743850141]
We present a new cost-effective vision-based method that perceives humans' locations in 3D and their body orientation from a single image.
We show that it is possible to rethink the concept of "social distancing" as a form of social interaction in contrast to a simple location-based rule.
arXiv Detail & Related papers (2020-09-01T10:12:30Z) - Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View
Geometry [62.29762409558553]
Epipolar constraints are at the core of feature matching and depth estimation in multi-person 3D human pose estimation methods.
Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances.
In this paper, we depart from the multi-person 3D pose estimation formulation, and instead reformulate it as crowd pose estimation.
arXiv Detail & Related papers (2020-07-21T17:59:36Z) - Coherent Reconstruction of Multiple Humans from a Single Image [68.3319089392548]
In this work, we address the problem of multi-person 3D pose estimation from a single image.
A typical regression approach in the top-down setting of this problem would first detect all humans and then reconstruct each one of them independently.
Our goal is to train a single network that learns to avoid these problems and generate a coherent 3D reconstruction of all the humans in the scene.
arXiv Detail & Related papers (2020-06-15T17:51:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.