On the Utility of 3D Hand Poses for Action Recognition
- URL: http://arxiv.org/abs/2403.09805v2
- Date: Wed, 14 Aug 2024 13:22:50 GMT
- Title: On the Utility of 3D Hand Poses for Action Recognition
- Authors: Md Salman Shamil, Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao,
- Abstract summary: HandFormer is a novel multimodal transformer to efficiently model hand-object interactions.
We factorize hand modeling and represent each joint by its short-term trajectories.
We achieve new state-of-the-art performance on Assembly101 and H2O with significant improvements in egocentric action recognition.
- Score: 36.64538554919222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D hand pose is an underexplored modality for action recognition. Poses are compact yet informative and can greatly benefit applications with limited compute budgets. However, poses alone offer an incomplete understanding of actions, as they cannot fully capture objects and environments with which humans interact. We propose HandFormer, a novel multimodal transformer, to efficiently model hand-object interactions. HandFormer combines 3D hand poses at a high temporal resolution for fine-grained motion modeling with sparsely sampled RGB frames for encoding scene semantics. Observing the unique characteristics of hand poses, we temporally factorize hand modeling and represent each joint by its short-term trajectories. This factorized pose representation combined with sparse RGB samples is remarkably efficient and highly accurate. Unimodal HandFormer with only hand poses outperforms existing skeleton-based methods at 5x fewer FLOPs. With RGB, we achieve new state-of-the-art performance on Assembly101 and H2O with significant improvements in egocentric action recognition.
Related papers
- HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud [60.47544798202017]
Hand pose estimation is a critical task in various human-computer interaction applications.
This paper proposes HandDiff, a diffusion-based hand pose estimation model that iteratively denoises accurate hand pose conditioned on hand-shaped image-point clouds.
Experimental results demonstrate that the proposed HandDiff significantly outperforms the existing approaches on four challenging hand pose benchmark datasets.
arXiv Detail & Related papers (2024-04-04T02:15:16Z) - Denoising Diffusion for 3D Hand Pose Estimation from Images [38.20064386142944]
This paper addresses the problem of 3D hand pose estimation from monocular images or sequences.
We present a novel end-to-end framework for 3D hand regression that employs diffusion models that have shown excellent ability to capture the distribution of data for generative purposes.
The proposed model provides state-of-the-art performance when lifting a 2D single-hand image to 3D.
arXiv Detail & Related papers (2023-08-18T12:57:22Z) - Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation [59.3035531612715]
Existing methods often struggle to generate plausible hand poses when the hand is heavily occluded or blurred.
In videos, the movements of the hand allow us to observe various parts of the hand that may be occluded or blurred in a single frame.
We propose the Deformer: a framework that implicitly reasons about the relationship between hand parts within the same image.
arXiv Detail & Related papers (2023-03-09T02:24:30Z) - 3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal [85.30756038989057]
Estimating 3D interacting hand pose from a single RGB image is essential for understanding human actions.
We propose to decompose the challenging interacting hand pose estimation task and estimate the pose of each hand separately.
Experiments show that the proposed method significantly outperforms previous state-of-the-art interacting hand pose estimation approaches.
arXiv Detail & Related papers (2022-07-22T13:04:06Z) - Monocular 3D Reconstruction of Interacting Hands via Collision-Aware
Factorized Refinements [96.40125818594952]
We make the first attempt to reconstruct 3D interacting hands from monocular single RGB images.
Our method can generate 3D hand meshes with both precise 3D poses and minimal collisions.
arXiv Detail & Related papers (2021-11-01T08:24:10Z) - SeqHAND:RGB-Sequence-Based 3D Hand Pose and Shape Estimation [48.456638103309544]
3D hand pose estimation based on RGB images has been studied for a long time.
We propose a novel method that generates a synthetic dataset that mimics natural human hand movements.
We show that utilizing temporal information for 3D hand pose estimation significantly enhances general pose estimations.
arXiv Detail & Related papers (2020-07-10T05:11:14Z) - Leveraging Photometric Consistency over Time for Sparsely Supervised
Hand-Object Reconstruction [118.21363599332493]
We present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video.
Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses.
We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy.
arXiv Detail & Related papers (2020-04-28T12:03:14Z) - Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and
Objects for 3D Hand Pose Estimation under Hand-Object Interaction [137.28465645405655]
HANDS'19 is a challenge to evaluate the abilities of current 3D hand pose estimators (HPEs) to interpolate and extrapolate the poses of a training set.
We show that the accuracy of state-of-the-art methods can drop, and that they fail mostly on poses absent from the training set.
arXiv Detail & Related papers (2020-03-30T19:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.