HMP: Hand Motion Priors for Pose and Shape Estimation from Video
- URL: http://arxiv.org/abs/2312.16737v1
- Date: Wed, 27 Dec 2023 22:35:33 GMT
- Title: HMP: Hand Motion Priors for Pose and Shape Estimation from Video
- Authors: Enes Duran, Muhammed Kocabas, Vasileios Choutas, Zicong Fan and
Michael J. Black
- Abstract summary: We develop a generative motion prior specific for hands, trained on the AMASS dataset which features diverse and high-quality hand motions.
Our integration of a robust motion prior significantly enhances performance, especially in occluded scenarios.
We demonstrate our method's efficacy via qualitative and quantitative evaluations on the HO3D and DexYCB datasets.
- Score: 52.39020275278984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding how humans interact with the world necessitates accurate 3D
hand pose estimation, a task complicated by the hand's high degree of
articulation, frequent occlusions, self-occlusions, and rapid motions. While
most existing methods rely on single-image inputs, videos have useful cues to
address aforementioned issues. However, existing video-based 3D hand datasets
are insufficient for training feedforward models to generalize to in-the-wild
scenarios. On the other hand, we have access to large human motion capture
datasets which also include hand motions, e.g. AMASS. Therefore, we develop a
generative motion prior specific for hands, trained on the AMASS dataset which
features diverse and high-quality hand motions. This motion prior is then
employed for video-based 3D hand motion estimation following a latent
optimization approach. Our integration of a robust motion prior significantly
enhances performance, especially in occluded scenarios. It produces stable,
temporally consistent results that surpass conventional single-frame methods.
We demonstrate our method's efficacy via qualitative and quantitative
evaluations on the HO3D and DexYCB datasets, with special emphasis on an
occlusion-focused subset of HO3D. Code is available at
https://hmp.is.tue.mpg.de
Related papers
- Bundle Adjusted Gaussian Avatars Deblurring [31.718130377229482]
We propose a 3D-aware, physics-oriented model of blur formation attributable to human movement and a 3D human motion model to clarify ambiguities found in motion-induced blurry images.
We have established benchmarks for this task through a synthetic dataset derived from existing multi-view captures, alongside a real-captured dataset acquired through a 360-degree synchronous hybrid-exposure camera system.
arXiv Detail & Related papers (2024-11-24T10:03:24Z) - Denoising Diffusion for 3D Hand Pose Estimation from Images [38.20064386142944]
This paper addresses the problem of 3D hand pose estimation from monocular images or sequences.
We present a novel end-to-end framework for 3D hand regression that employs diffusion models that have shown excellent ability to capture the distribution of data for generative purposes.
The proposed model provides state-of-the-art performance when lifting a 2D single-hand image to 3D.
arXiv Detail & Related papers (2023-08-18T12:57:22Z) - Scene-Aware 3D Multi-Human Motion Capture from a Single Camera [83.06768487435818]
We consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera.
We leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks.
In particular, we estimate the scene depth and unique person scale from normalized disparity predictions using the 2D body joints and joint angles.
arXiv Detail & Related papers (2023-01-12T18:01:28Z) - Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape
Estimation from Monocular Video [24.217269857183233]
We propose a motion pose and shape network (MPS-Net) to capture humans in motion to estimate 3D human pose and shape from a video.
Specifically, we first propose a motion continuity attention (MoCA) module that leverages visual cues observed from human motion to adaptively recalibrate the range that needs attention in the sequence.
By coupling the MoCA and HAFI modules, the proposed MPS-Net excels in estimating 3D human pose and shape in the video.
arXiv Detail & Related papers (2022-03-16T11:00:24Z) - Estimating 3D Motion and Forces of Human-Object Interactions from
Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video.
Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z) - Render In-between: Motion Guided Video Synthesis for Action
Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance.
A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset.
Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body
Dynamics [87.17505994436308]
We build upon the insight that body motion and hand gestures are strongly correlated in non-verbal communication settings.
We formulate the learning of this prior as a prediction task of 3D hand shape over time given body motion input alone.
Our hand prediction model produces convincing 3D hand gestures given only the 3D motion of the speaker's arms as input.
arXiv Detail & Related papers (2020-07-23T22:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.