HMP: Hand Motion Priors for Pose and Shape Estimation from Video
- URL: http://arxiv.org/abs/2312.16737v1
- Date: Wed, 27 Dec 2023 22:35:33 GMT
- Title: HMP: Hand Motion Priors for Pose and Shape Estimation from Video
- Authors: Enes Duran, Muhammed Kocabas, Vasileios Choutas, Zicong Fan and
Michael J. Black
- Abstract summary: We develop a generative motion prior specific for hands, trained on the AMASS dataset which features diverse and high-quality hand motions.
Our integration of a robust motion prior significantly enhances performance, especially in occluded scenarios.
We demonstrate our method's efficacy via qualitative and quantitative evaluations on the HO3D and DexYCB datasets.
- Score: 52.39020275278984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding how humans interact with the world necessitates accurate 3D
hand pose estimation, a task complicated by the hand's high degree of
articulation, frequent occlusions, self-occlusions, and rapid motions. While
most existing methods rely on single-image inputs, videos have useful cues to
address aforementioned issues. However, existing video-based 3D hand datasets
are insufficient for training feedforward models to generalize to in-the-wild
scenarios. On the other hand, we have access to large human motion capture
datasets which also include hand motions, e.g. AMASS. Therefore, we develop a
generative motion prior specific for hands, trained on the AMASS dataset which
features diverse and high-quality hand motions. This motion prior is then
employed for video-based 3D hand motion estimation following a latent
optimization approach. Our integration of a robust motion prior significantly
enhances performance, especially in occluded scenarios. It produces stable,
temporally consistent results that surpass conventional single-frame methods.
We demonstrate our method's efficacy via qualitative and quantitative
evaluations on the HO3D and DexYCB datasets, with special emphasis on an
occlusion-focused subset of HO3D. Code is available at
https://hmp.is.tue.mpg.de
Related papers
- HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation [60.2305990057581]
3D hand pose estimation is crucial for many human-computer interaction applications such as augmented reality.<n>HandMCM is a novel method based on the powerful state space model (Mamba)
arXiv Detail & Related papers (2026-02-02T03:25:43Z) - SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation [25.88676013839077]
SFHand is the first streaming framework for language-guided 3D hand forecasting.<n> SFHand autoregressively predicts a comprehensive set of future 3D hand states.<n>EgoHaFL is the first large-scale dataset featuring synchronized 3D hand poses and language instructions.
arXiv Detail & Related papers (2025-11-22T17:22:24Z) - Object-Aware 4D Human Motion Generation [20.338809521456298]
We propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors.<n>Our framework produces natural and physically plausible human motions that respect 3D spatial context.
arXiv Detail & Related papers (2025-10-31T20:40:17Z) - AssemblyHands-X: Modeling 3D Hand-Body Coordination for Understanding Bimanual Human Activities [27.634829042887358]
We present AssemblyHands-X, the first markerless 3D hand-body benchmark for bimanual activities.<n>Our approach combines multi-view triangulation with SMPL-X mesh fitting, yielding reliable 3D registration of hands and upper body.<n>Our experiments show pose-based action inference is more efficient and accurate than video baselines.
arXiv Detail & Related papers (2025-09-28T13:52:14Z) - InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation [54.09384502044162]
We introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements.<n>First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations.<n>Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions.<n>Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance.
arXiv Detail & Related papers (2025-09-11T15:43:54Z) - JGHand: Joint-Driven Animatable Hand Avater via 3D Gaussian Splatting [3.1143479095236892]
Jointly 3D Gaussian Hand (JGHand) is a novel joint-driven 3D Gaussian Splatting (3DGS)-based hand representation.
We show that JGHand achieves real-time rendering speeds with enhanced quality, surpassing state-of-the-art methods.
arXiv Detail & Related papers (2025-01-31T12:33:24Z) - Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera [49.82535393220003]
Dyn-HaMR is the first approach to reconstruct 4D global hand motion from monocular videos recorded by dynamic cameras in the wild.
We show that our approach significantly outperforms state-of-the-art methods in terms of 4D global mesh recovery.
This establishes a new benchmark for hand motion reconstruction from monocular video with moving cameras.
arXiv Detail & Related papers (2024-12-17T12:43:10Z) - Bundle Adjusted Gaussian Avatars Deblurring [31.718130377229482]
We propose a 3D-aware, physics-oriented model of blur formation attributable to human movement and a 3D human motion model to clarify ambiguities found in motion-induced blurry images.
We have established benchmarks for this task through a synthetic dataset derived from existing multi-view captures, alongside a real-captured dataset acquired through a 360-degree synchronous hybrid-exposure camera system.
arXiv Detail & Related papers (2024-11-24T10:03:24Z) - Denoising Diffusion for 3D Hand Pose Estimation from Images [38.20064386142944]
This paper addresses the problem of 3D hand pose estimation from monocular images or sequences.
We present a novel end-to-end framework for 3D hand regression that employs diffusion models that have shown excellent ability to capture the distribution of data for generative purposes.
The proposed model provides state-of-the-art performance when lifting a 2D single-hand image to 3D.
arXiv Detail & Related papers (2023-08-18T12:57:22Z) - Scene-Aware 3D Multi-Human Motion Capture from a Single Camera [83.06768487435818]
We consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera.
We leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks.
In particular, we estimate the scene depth and unique person scale from normalized disparity predictions using the 2D body joints and joint angles.
arXiv Detail & Related papers (2023-01-12T18:01:28Z) - Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape
Estimation from Monocular Video [24.217269857183233]
We propose a motion pose and shape network (MPS-Net) to capture humans in motion to estimate 3D human pose and shape from a video.
Specifically, we first propose a motion continuity attention (MoCA) module that leverages visual cues observed from human motion to adaptively recalibrate the range that needs attention in the sequence.
By coupling the MoCA and HAFI modules, the proposed MPS-Net excels in estimating 3D human pose and shape in the video.
arXiv Detail & Related papers (2022-03-16T11:00:24Z) - Estimating 3D Motion and Forces of Human-Object Interactions from
Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video.
Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z) - Render In-between: Motion Guided Video Synthesis for Action
Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance.
A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset.
Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body
Dynamics [87.17505994436308]
We build upon the insight that body motion and hand gestures are strongly correlated in non-verbal communication settings.
We formulate the learning of this prior as a prediction task of 3D hand shape over time given body motion input alone.
Our hand prediction model produces convincing 3D hand gestures given only the 3D motion of the speaker's arms as input.
arXiv Detail & Related papers (2020-07-23T22:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.