Multi-View Video-Based 3D Hand Pose Estimation
- URL: http://arxiv.org/abs/2109.11747v1
- Date: Fri, 24 Sep 2021 05:20:41 GMT
- Title: Multi-View Video-Based 3D Hand Pose Estimation
- Authors: Leyla Khaleghi, Alireza Sepas Moghaddam, Joshua Marshall, Ali Etemad
- Abstract summary: We present the Multi-View Video-Based 3D Hand dataset, consisting of multi-view videos of the hand along with ground-truth 3D pose labels.
Our dataset includes more than 402,000 synthetic hand images available in 4,560 videos.
Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand.
- Score: 11.65577683784217
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Hand pose estimation (HPE) can be used for a variety of human-computer
interaction applications such as gesture-based control for physical or
virtual/augmented reality devices. Recent works have shown that videos or
multi-view images carry rich information regarding the hand, allowing for the
development of more robust HPE systems. In this paper, we present the
Multi-View Video-Based 3D Hand (MuViHand) dataset, consisting of multi-view
videos of the hand along with ground-truth 3D pose labels. Our dataset includes
more than 402,000 synthetic hand images available in 4,560 videos. The videos
have been simultaneously captured from six different angles with complex
backgrounds and random levels of dynamic lighting. The data has been captured
from 10 distinct animated subjects using 12 cameras in a semi-circle topology
where six tracking cameras only focus on the hand and the other six fixed
cameras capture the entire body. Next, we implement MuViHandNet, a neural
pipeline consisting of image encoders for obtaining visual embeddings of the
hand, recurrent learners to learn both temporal and angular sequential
information, and graph networks with U-Net architectures to estimate the final
3D pose information. We perform extensive experiments and show the challenging
nature of this new dataset as well as the effectiveness of our proposed method.
Ablation studies show the added value of each component in MuViHandNet, as well
as the benefit of having temporal and sequential information in the dataset.
Related papers
- PIV3CAMS: a multi-camera dataset for multiple computer vision problems and its application to novel view-point synthesis [120.4361056355332]
This thesis introduces Paired Image and Video data from three CAMeraS, namely PIV3CAMS.
The PIV3CAMS dataset consists of 8385 pairs of images and 82 pairs of videos taken from three different cameras.
In addition to the regeneration of a current state-of-the-art algorithm, we investigate several proposed alternative models that integrate depth information geometrically.
arXiv Detail & Related papers (2024-07-26T12:18:29Z) - HUP-3D: A 3D multi-view synthetic dataset for assisted-egocentric hand-ultrasound pose estimation [11.876066932162873]
HUP-3D is a 3D multiview synthetic dataset for hand-ultrasound probe pose estimation.
Our dataset consists of over 31k sets of movements.
Our approach includes image rendering concept, enhancing diversity with various hand and arm textures.
arXiv Detail & Related papers (2024-07-12T12:25:42Z) - HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction [16.363878619678367]
We introduce a new dataset named HO-Cap that can be used to study 3D reconstruction and pose tracking of hands and objects in videos.
We propose a semi-automatic method to obtain annotations of shape and pose of hands and objects in the collected videos.
Our data capture setup and annotation framework can be used by the community to reconstruct 3D shapes of objects and human hands and track their poses in videos.
arXiv Detail & Related papers (2024-06-10T23:25:19Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Learning to Deblur and Rotate Motion-Blurred Faces [43.673660541417995]
We train a neural network to reconstruct a 3D video representation from a single image and the corresponding face gaze.
We then provide a camera viewpoint relative to the estimated gaze and the blurry image as input to an encoder-decoder network to generate a video of sharp frames with a novel camera viewpoint.
arXiv Detail & Related papers (2021-12-14T17:51:19Z) - 4D-Net for Learned Multi-Modal Alignment [87.58354992455891]
We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time.
We are able to incorporate the 4D information by performing a novel connection learning across various feature representations and levels of abstraction, as well as by observing geometric constraints.
arXiv Detail & Related papers (2021-09-02T16:35:00Z) - D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z) - Self-Supervised Multi-View Synchronization Learning for 3D Pose
Estimation [39.334995719523]
Current methods cast monocular 3D human pose estimation as a learning problem by training neural networks on large data sets of images and corresponding skeleton poses.
We propose an approach that can exploit small annotated data sets by fine-tuning networks pre-trained via self-supervised learning on (large) unlabeled data sets.
We demonstrate the effectiveness of the synchronization task on the Human3.6M data set and achieve state-of-the-art results in 3D human pose estimation.
arXiv Detail & Related papers (2020-10-13T08:01:24Z) - MM-Hand: 3D-Aware Multi-Modal Guided Hand Generative Network for 3D Hand
Pose Synthesis [81.40640219844197]
Estimating the 3D hand pose from a monocular RGB image is important but challenging.
A solution is training on large-scale RGB hand images with accurate 3D hand keypoint annotations.
We have developed a learning-based approach to synthesize realistic, diverse, and 3D pose-preserving hand images.
arXiv Detail & Related papers (2020-10-02T18:27:34Z) - Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data [77.34069717612493]
We present a novel method for monocular hand shape and pose estimation at unprecedented runtime performance of 100fps.
This is enabled by a new learning based architecture designed such that it can make use of all the sources of available hand training data.
It features a 3D hand joint detection module and an inverse kinematics module which regresses not only 3D joint positions but also maps them to joint rotations in a single feed-forward pass.
arXiv Detail & Related papers (2020-03-21T03:51:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.