3D Human Pose Estimation using Spatio-Temporal Networks with Explicit
Occlusion Training
- URL: http://arxiv.org/abs/2004.11822v1
- Date: Tue, 7 Apr 2020 09:12:12 GMT
- Title: 3D Human Pose Estimation using Spatio-Temporal Networks with Explicit
Occlusion Training
- Authors: Yu Cheng, Bo Yang, Bo Wang, Robby T. Tan
- Abstract summary: Estimating 3D poses from a monocular task is still a challenging task, despite the significant progress that has been made in recent years.
We introduce a-temporal video network for robust 3D human pose estimation.
We apply multi-scale spatial features for 2D joints or keypoints prediction in each individual frame, and multistride temporal convolutional net-works (TCNs) to estimate 3D joints or keypoints.
- Score: 40.933783830017035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating 3D poses from a monocular video is still a challenging task,
despite the significant progress that has been made in recent years. Generally,
the performance of existing methods drops when the target person is too
small/large, or the motion is too fast/slow relative to the scale and speed of
the training data. Moreover, to our knowledge, many of these methods are not
designed or trained under severe occlusion explicitly, making their performance
on handling occlusion compromised. Addressing these problems, we introduce a
spatio-temporal network for robust 3D human pose estimation. As humans in
videos may appear in different scales and have various motion speeds, we apply
multi-scale spatial features for 2D joints or keypoints prediction in each
individual frame, and multi-stride temporal convolutional net-works (TCNs) to
estimate 3D joints or keypoints. Furthermore, we design a spatio-temporal
discriminator based on body structures as well as limb motions to assess
whether the predicted pose forms a valid pose and a valid movement. During
training, we explicitly mask out some keypoints to simulate various occlusion
cases, from minor to severe occlusion, so that our network can learn better and
becomes robust to various degrees of occlusion. As there are limited 3D
ground-truth data, we further utilize 2D video data to inject a semi-supervised
learning capability to our network. Experiments on public datasets validate the
effectiveness of our method, and our ablation studies show the strengths of our
network\'s individual submodules.
Related papers
- Occlusion Resilient 3D Human Pose Estimation [52.49366182230432]
Occlusions remain one of the key challenges in 3D body pose estimation from single-camera video sequences.
We demonstrate the effectiveness of this approach compared to state-of-the-art techniques that infer poses from single-camera sequences.
arXiv Detail & Related papers (2024-02-16T19:29:43Z) - STRIDE: Single-video based Temporally Continuous Occlusion Robust 3D Pose Estimation [27.854074900345314]
We propose STRIDE, a novel Test-Time Training (TTT) approach to fit a human motion prior to each video.
Our framework demonstrates flexibility by being model-agnostic, allowing us to use any off-the-shelf 3D pose estimation method for improving robustness and temporal consistency.
We validate STRIDE's efficacy through comprehensive experiments on challenging datasets like Occluded Human3.6M, Human3.6M, and OCMotion.
arXiv Detail & Related papers (2023-12-24T11:05:10Z) - Decanus to Legatus: Synthetic training for 2D-3D human pose lifting [26.108023246654646]
We propose an algorithm to generate infinite 3D synthetic human poses (Legatus) from a 3D pose distribution based on 10 initial handcrafted 3D poses (Decanus)
Our results show that we can achieve 3D pose estimation performance comparable to methods using real data from specialized datasets but in a zero-shot setup, showing the potential of our framework.
arXiv Detail & Related papers (2022-10-05T13:10:19Z) - Occluded Human Body Capture with Self-Supervised Spatial-Temporal Motion
Prior [7.157324258813676]
We build the first 3D occluded motion dataset(OcMotion), which can be used for both training and testing.
A spatial-temporal layer is then designed to learn joint-level correlations.
Experimental results show that our method can generate accurate and coherent human motions from occluded videos with good generalization ability and runtime efficiency.
arXiv Detail & Related papers (2022-07-12T08:15:11Z) - On Triangulation as a Form of Self-Supervision for 3D Human Pose
Estimation [57.766049538913926]
Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant.
Much of the recent attention has shifted towards semi and (or) weakly supervised learning.
We propose to impose multi-view geometrical constraints by means of a differentiable triangulation and to use it as form of self-supervision during training when no labels are available.
arXiv Detail & Related papers (2022-03-29T19:11:54Z) - PONet: Robust 3D Human Pose Estimation via Learning Orientations Only [116.1502793612437]
We propose a novel Pose Orientation Net (PONet) that is able to robustly estimate 3D pose by learning orientations only.
PONet estimates the 3D orientation of these limbs by taking advantage of the local image evidence to recover the 3D pose.
We evaluate our method on multiple datasets, including Human3.6M, MPII, MPI-INF-3DHP, and 3DPW.
arXiv Detail & Related papers (2021-12-21T12:48:48Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage
Optimization [33.02708860641971]
Estimating 3D human poses from a monocular video is still a challenging task.
Many existing methods drop when the target person is cluded by other objects, or the motion is too fast/slow relative to the scale and speed of the training data.
We introduce atemporal-temporal network for robust 3D human pose estimation.
arXiv Detail & Related papers (2020-10-13T15:24:28Z) - Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation [52.94078950641959]
We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation.
We adopt a novel neural representation of multi-person 3D pose which unifies the position of person instances with their corresponding 3D pose representation.
We propose a practical deployment paradigm where paired 2D or 3D pose annotations are unavailable.
arXiv Detail & Related papers (2020-08-04T07:54:25Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.