Multi-View Video-Based 3D Hand Pose Estimation
- URL: http://arxiv.org/abs/2109.11747v1
- Date: Fri, 24 Sep 2021 05:20:41 GMT
- Title: Multi-View Video-Based 3D Hand Pose Estimation
- Authors: Leyla Khaleghi, Alireza Sepas Moghaddam, Joshua Marshall, Ali Etemad
- Abstract summary: We present the Multi-View Video-Based 3D Hand dataset, consisting of multi-view videos of the hand along with ground-truth 3D pose labels.
Our dataset includes more than 402,000 synthetic hand images available in 4,560 videos.
Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand.
- Score: 11.65577683784217
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Hand pose estimation (HPE) can be used for a variety of human-computer
interaction applications such as gesture-based control for physical or
virtual/augmented reality devices. Recent works have shown that videos or
multi-view images carry rich information regarding the hand, allowing for the
development of more robust HPE systems. In this paper, we present the
Multi-View Video-Based 3D Hand (MuViHand) dataset, consisting of multi-view
videos of the hand along with ground-truth 3D pose labels. Our dataset includes
more than 402,000 synthetic hand images available in 4,560 videos. The videos
have been simultaneously captured from six different angles with complex
backgrounds and random levels of dynamic lighting. The data has been captured
from 10 distinct animated subjects using 12 cameras in a semi-circle topology
where six tracking cameras only focus on the hand and the other six fixed
cameras capture the entire body. Next, we implement MuViHandNet, a neural
pipeline consisting of image encoders for obtaining visual embeddings of the
hand, recurrent learners to learn both temporal and angular sequential
information, and graph networks with U-Net architectures to estimate the final
3D pose information. We perform extensive experiments and show the challenging
nature of this new dataset as well as the effectiveness of our proposed method.
Ablation studies show the added value of each component in MuViHandNet, as well
as the benefit of having temporal and sequential information in the dataset.
Related papers
- VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation [62.64811405314847]
We introduce VidCRAFT3, a novel framework for precise image-to-video generation.
It enables control over camera motion, object motion, and lighting direction simultaneously.
Experiments on benchmark datasets demonstrate the efficacy of VidCRAFT3 in producing high-quality video content.
arXiv Detail & Related papers (2025-02-11T13:11:59Z) - HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos [9.513100627302755]
We introduce HOT3D, a dataset for egocentric hand and object tracking in 3D.
The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects.
In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, 6DoF object pose estimation, and 3D lifting of unknown in-hand objects.
arXiv Detail & Related papers (2024-11-28T14:09:42Z) - PIV3CAMS: a multi-camera dataset for multiple computer vision problems and its application to novel view-point synthesis [120.4361056355332]
This thesis introduces Paired Image and Video data from three CAMeraS, namely PIV3CAMS.
The PIV3CAMS dataset consists of 8385 pairs of images and 82 pairs of videos taken from three different cameras.
In addition to the regeneration of a current state-of-the-art algorithm, we investigate several proposed alternative models that integrate depth information geometrically.
arXiv Detail & Related papers (2024-07-26T12:18:29Z) - HUP-3D: A 3D multi-view synthetic dataset for assisted-egocentric hand-ultrasound pose estimation [11.876066932162873]
HUP-3D is a 3D multiview synthetic dataset for hand-ultrasound probe pose estimation.
Our dataset consists of over 31k sets of movements.
Our approach includes image rendering concept, enhancing diversity with various hand and arm textures.
arXiv Detail & Related papers (2024-07-12T12:25:42Z) - HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction [16.363878619678367]
We introduce a data capture system and a new dataset, HO-Cap, for 3D reconstruction and pose tracking of hands and objects in videos.
The system leverages multiple RGB-D cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems.
We propose a semi-automatic method for annotating the shape and pose of hands and objects in the collected videos, significantly reducing the annotation time compared to manual labeling.
arXiv Detail & Related papers (2024-06-10T23:25:19Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Learning to Deblur and Rotate Motion-Blurred Faces [43.673660541417995]
We train a neural network to reconstruct a 3D video representation from a single image and the corresponding face gaze.
We then provide a camera viewpoint relative to the estimated gaze and the blurry image as input to an encoder-decoder network to generate a video of sharp frames with a novel camera viewpoint.
arXiv Detail & Related papers (2021-12-14T17:51:19Z) - 4D-Net for Learned Multi-Modal Alignment [87.58354992455891]
We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time.
We are able to incorporate the 4D information by performing a novel connection learning across various feature representations and levels of abstraction, as well as by observing geometric constraints.
arXiv Detail & Related papers (2021-09-02T16:35:00Z) - D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z) - MM-Hand: 3D-Aware Multi-Modal Guided Hand Generative Network for 3D Hand
Pose Synthesis [81.40640219844197]
Estimating the 3D hand pose from a monocular RGB image is important but challenging.
A solution is training on large-scale RGB hand images with accurate 3D hand keypoint annotations.
We have developed a learning-based approach to synthesize realistic, diverse, and 3D pose-preserving hand images.
arXiv Detail & Related papers (2020-10-02T18:27:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.