STRIDE: Single-video based Temporally Continuous Occlusion Robust 3D Pose Estimation
- URL: http://arxiv.org/abs/2312.16221v2
- Date: Thu, 14 Mar 2024 03:36:00 GMT
- Title: STRIDE: Single-video based Temporally Continuous Occlusion Robust 3D Pose Estimation
- Authors: Rohit Lal, Saketh Bachu, Yash Garg, Arindam Dutta, Calvin-Khang Ta, Dripta S. Raychaudhuri, Hannah Dela Cruz, M. Salman Asif, Amit K. Roy-Chowdhury,
- Abstract summary: We propose STRIDE, a novel Test-Time Training (TTT) approach to fit a human motion prior to each video.
Our framework demonstrates flexibility by being model-agnostic, allowing us to use any off-the-shelf 3D pose estimation method for improving robustness and temporal consistency.
We validate STRIDE's efficacy through comprehensive experiments on challenging datasets like Occluded Human3.6M, Human3.6M, and OCMotion.
- Score: 27.854074900345314
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The capability to accurately estimate 3D human poses is crucial for diverse fields such as action recognition, gait recognition, and virtual/augmented reality. However, a persistent and significant challenge within this field is the accurate prediction of human poses under conditions of severe occlusion. Traditional image-based estimators struggle with heavy occlusions due to a lack of temporal context, resulting in inconsistent predictions. While video-based models benefit from processing temporal data, they encounter limitations when faced with prolonged occlusions that extend over multiple frames. This challenge arises because these models struggle to generalize beyond their training datasets, and the variety of occlusions is hard to capture in the training data. Addressing these challenges, we propose STRIDE (Single-video based TempoRally contInuous occlusion Robust 3D Pose Estimation), a novel Test-Time Training (TTT) approach to fit a human motion prior for each video. This approach specifically handles occlusions that were not encountered during the model's training. By employing STRIDE, we can refine a sequence of noisy initial pose estimates into accurate, temporally coherent poses during test time, effectively overcoming the limitations of prior methods. Our framework demonstrates flexibility by being model-agnostic, allowing us to use any off-the-shelf 3D pose estimation method for improving robustness and temporal consistency. We validate STRIDE's efficacy through comprehensive experiments on challenging datasets like Occluded Human3.6M, Human3.6M, and OCMotion, where it not only outperforms existing single-image and video-based pose estimation models but also showcases superior handling of substantial occlusions, achieving fast, robust, accurate, and temporally consistent 3D pose estimates.
Related papers
- Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D
Human Motion Recovery from Monocular Videos [5.258814754543826]
We propose a novel method for temporally consistent motion estimation from a monocular video.
Instead of using generic ResNet-like features, our method uses a body-aware feature representation and an independent per-frame pose.
Our method attains significantly lower acceleration error and outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2023-11-20T10:53:59Z) - Live Stream Temporally Embedded 3D Human Body Pose and Shape Estimation [13.40702053084305]
We present a temporally embedded 3D human body pose and shape estimation (TePose) method to improve the accuracy and temporal consistency pose in live stream videos.
A multi-scale convolutional network is presented as the motion discriminator for adversarial training using datasets without any 3D labeling.
arXiv Detail & Related papers (2022-07-25T21:21:59Z) - On Triangulation as a Form of Self-Supervision for 3D Human Pose
Estimation [57.766049538913926]
Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant.
Much of the recent attention has shifted towards semi and (or) weakly supervised learning.
We propose to impose multi-view geometrical constraints by means of a differentiable triangulation and to use it as form of self-supervision during training when no labels are available.
arXiv Detail & Related papers (2022-03-29T19:11:54Z) - Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose
Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations.
We derive suitable measures to quantify prediction uncertainty at both pose and joint level.
We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z) - Learning Dynamics via Graph Neural Networks for Human Pose Estimation
and Tracking [98.91894395941766]
We propose a novel online approach to learning the pose dynamics, which are independent of pose detections in current fame.
Specifically, we derive this prediction of dynamics through a graph neural network(GNN) that explicitly accounts for both spatial-temporal and visual information.
Experiments on PoseTrack 2017 and PoseTrack 2018 datasets demonstrate that the proposed method achieves results superior to the state of the art on both human pose estimation and tracking tasks.
arXiv Detail & Related papers (2021-06-07T16:36:50Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - Deep Dual Consecutive Network for Human Pose Estimation [44.41818683253614]
We propose a novel multi-frame human pose estimation framework, leveraging abundant temporal cues between video frames to facilitate keypoint detection.
Our method ranks No.1 in the Multi-frame Person Pose Challenge Challenge on the large-scale benchmark datasets PoseTrack 2017 and PoseTrack 2018.
arXiv Detail & Related papers (2021-03-12T13:11:27Z) - Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage
Optimization [33.02708860641971]
Estimating 3D human poses from a monocular video is still a challenging task.
Many existing methods drop when the target person is cluded by other objects, or the motion is too fast/slow relative to the scale and speed of the training data.
We introduce atemporal-temporal network for robust 3D human pose estimation.
arXiv Detail & Related papers (2020-10-13T15:24:28Z) - Kinematic-Structure-Preserved Representation for Unsupervised 3D Human
Pose Estimation [58.72192168935338]
Generalizability of human pose estimation models developed using supervision on large-scale in-studio datasets remains questionable.
We propose a novel kinematic-structure-preserved unsupervised 3D pose estimation framework, which is not restrained by any paired or unpaired weak supervisions.
Our proposed model employs three consecutive differentiable transformations named as forward-kinematics, camera-projection and spatial-map transformation.
arXiv Detail & Related papers (2020-06-24T23:56:33Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z) - 3D Human Pose Estimation using Spatio-Temporal Networks with Explicit
Occlusion Training [40.933783830017035]
Estimating 3D poses from a monocular task is still a challenging task, despite the significant progress that has been made in recent years.
We introduce a-temporal video network for robust 3D human pose estimation.
We apply multi-scale spatial features for 2D joints or keypoints prediction in each individual frame, and multistride temporal convolutional net-works (TCNs) to estimate 3D joints or keypoints.
arXiv Detail & Related papers (2020-04-07T09:12:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.