Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D
Human Motion Recovery from Monocular Videos
- URL: http://arxiv.org/abs/2311.11662v1
- Date: Mon, 20 Nov 2023 10:53:59 GMT
- Title: Enhanced Spatio-Temporal Context for Temporally Consistent Robust 3D
Human Motion Recovery from Monocular Videos
- Authors: Sushovan Chanda and Amogh Tiwari and Lokender Tiwari and Brojeshwar
Bhowmick and Avinash Sharma and Hrishav Barua
- Abstract summary: We propose a novel method for temporally consistent motion estimation from a monocular video.
Instead of using generic ResNet-like features, our method uses a body-aware feature representation and an independent per-frame pose.
Our method attains significantly lower acceleration error and outperforms the existing state-of-the-art methods.
- Score: 5.258814754543826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recovering temporally consistent 3D human body pose, shape and motion from a
monocular video is a challenging task due to (self-)occlusions, poor lighting
conditions, complex articulated body poses, depth ambiguity, and limited
availability of annotated data. Further, doing a simple perframe estimation is
insufficient as it leads to jittery and implausible results. In this paper, we
propose a novel method for temporally consistent motion estimation from a
monocular video. Instead of using generic ResNet-like features, our method uses
a body-aware feature representation and an independent per-frame pose and
camera initialization over a temporal window followed by a novel
spatio-temporal feature aggregation by using a combination of self-similarity
and self-attention over the body-aware features and the perframe
initialization. Together, they yield enhanced spatiotemporal context for every
frame by considering remaining past and future frames. These features are used
to predict the pose and shape parameters of the human body model, which are
further refined using an LSTM. Experimental results on the publicly available
benchmark data show that our method attains significantly lower acceleration
error and outperforms the existing state-of-the-art methods over all key
quantitative evaluation metrics, including complex scenarios like partial
occlusion, complex poses and even relatively low illumination.
Related papers
- S3O: A Dual-Phase Approach for Reconstructing Dynamic Shape and Skeleton of Articulated Objects from Single Monocular Video [13.510513575340106]
Reconstructing dynamic articulated objects from a singular monocular video is challenging, requiring joint estimation of shape, motion, and camera parameters from limited views.
We propose Synergistic Shape and Skeleton Optimization (S3O), a novel two-phase method that efficiently learns parametric models including visible shapes and underlying skeletons.
Our experimental evaluations on standard benchmarks and the PlanetZoo dataset affirm that S3O provides more accurate 3D reconstruction, and plausible skeletons, and reduces the training time by approximately 60% compared to the state-of-the-art.
arXiv Detail & Related papers (2024-05-21T09:01:00Z) - STRIDE: Single-video based Temporally Continuous Occlusion Robust 3D Pose Estimation [27.854074900345314]
We propose STRIDE, a novel Test-Time Training (TTT) approach to fit a human motion prior to each video.
Our framework demonstrates flexibility by being model-agnostic, allowing us to use any off-the-shelf 3D pose estimation method for improving robustness and temporal consistency.
We validate STRIDE's efficacy through comprehensive experiments on challenging datasets like Occluded Human3.6M, Human3.6M, and OCMotion.
arXiv Detail & Related papers (2023-12-24T11:05:10Z) - Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction
on Monocular RGB Video [104.69686024776396]
Reconstructing interacting hands from monocular RGB data is a challenging task, as it involves many interfering factors.
Previous works only leverage information from a single RGB image without modeling their physically plausible relation.
In this work, we are dedicated to explicitly exploiting spatial-temporal information to achieve better interacting hand reconstruction.
arXiv Detail & Related papers (2023-08-08T06:16:37Z) - SST: Real-time End-to-end Monocular 3D Reconstruction via Sparse
Spatial-Temporal Guidance [71.3027345302485]
Real-time monocular 3D reconstruction is a challenging problem that remains unsolved.
We propose an end-to-end 3D reconstruction network SST, which utilizes Sparse estimated points from visual SLAM system.
SST outperforms all state-of-the-art competitors, whilst keeping a high inference speed at 59 FPS.
arXiv Detail & Related papers (2022-12-13T12:17:13Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - Imposing Temporal Consistency on Deep Monocular Body Shape and Pose
Estimation [67.23327074124855]
This paper presents an elegant solution for the integration of temporal constraints in the fitting process.
We derive parameters of a sequence of body models, representing shape and motion of a person, including jaw poses, facial expressions, and finger poses.
Our approach enables the derivation of realistic 3D body models from image sequences, including facial expression and articulated hands.
arXiv Detail & Related papers (2022-02-07T11:11:55Z) - Deep Dual Consecutive Network for Human Pose Estimation [44.41818683253614]
We propose a novel multi-frame human pose estimation framework, leveraging abundant temporal cues between video frames to facilitate keypoint detection.
Our method ranks No.1 in the Multi-frame Person Pose Challenge Challenge on the large-scale benchmark datasets PoseTrack 2017 and PoseTrack 2018.
arXiv Detail & Related papers (2021-03-12T13:11:27Z) - A-NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering [13.219688351773422]
We propose a test-time optimization approach for monocular motion capture that learns a volumetric body model of the user in a self-supervised manner.
Our approach is self-supervised and does not require any additional ground truth labels for appearance, pose, or 3D shape.
We demonstrate that our novel combination of a discriminative pose estimation technique with surface-free analysis-by-synthesis outperforms purely discriminative monocular pose estimation approaches.
arXiv Detail & Related papers (2021-02-11T18:58:31Z) - Monocular Real-time Full Body Capture with Inter-part Correlations [66.22835689189237]
We present the first method for real-time full body capture that estimates shape and motion of body and hands together with a dynamic 3D face model from a single color image.
Our approach uses a new neural network architecture that exploits correlations between body and hands at high computational efficiency.
arXiv Detail & Related papers (2020-12-11T02:37:56Z) - Leveraging Photometric Consistency over Time for Sparsely Supervised
Hand-Object Reconstruction [118.21363599332493]
We present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video.
Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses.
We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy.
arXiv Detail & Related papers (2020-04-28T12:03:14Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.