Related papers: Multi-View Video-Based 3D Hand Pose Estimation

Multi-View Video-Based 3D Hand Pose Estimation

URL: http://arxiv.org/abs/2109.11747v1
Date: Fri, 24 Sep 2021 05:20:41 GMT
Title: Multi-View Video-Based 3D Hand Pose Estimation
Authors: Leyla Khaleghi, Alireza Sepas Moghaddam, Joshua Marshall, Ali Etemad
Abstract summary: We present the Multi-View Video-Based 3D Hand dataset, consisting of multi-view videos of the hand along with ground-truth 3D pose labels. Our dataset includes more than 402,000 synthetic hand images available in 4,560 videos. Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand.
Score: 11.65577683784217
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Hand pose estimation (HPE) can be used for a variety of human-computer interaction applications such as gesture-based control for physical or virtual/augmented reality devices. Recent works have shown that videos or multi-view images carry rich information regarding the hand, allowing for the development of more robust HPE systems. In this paper, we present the Multi-View Video-Based 3D Hand (MuViHand) dataset, consisting of multi-view videos of the hand along with ground-truth 3D pose labels. Our dataset includes more than 402,000 synthetic hand images available in 4,560 videos. The videos have been simultaneously captured from six different angles with complex backgrounds and random levels of dynamic lighting. The data has been captured from 10 distinct animated subjects using 12 cameras in a semi-circle topology where six tracking cameras only focus on the hand and the other six fixed cameras capture the entire body. Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand, recurrent learners to learn both temporal and angular sequential information, and graph networks with U-Net architectures to estimate the final 3D pose information. We perform extensive experiments and show the challenging nature of this new dataset as well as the effectiveness of our proposed method. Ablation studies show the added value of each component in MuViHandNet, as well as the benefit of having temporal and sequential information in the dataset.

Related papers

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning [71.02843679746563]
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. We propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information.
arXiv Detail & Related papers (2025-03-02T18:49:48Z)
HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos [9.513100627302755]
The dataset offers over 833 minutes (3.7M+ images) of recordings that feature 19 subjects interacting with 33 diverse rigid objects. The recordings include multiple synchronized data streams containing egocentric multi-view RGB/monochrome images, eye gaze signal, scene point clouds, and 3D poses of cameras, hands, and objects. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, model-based 6DoF object pose estimation, and 3D lifting of unknown in-hand objects.
arXiv Detail & Related papers (2024-11-28T14:09:42Z)
PIV3CAMS: a multi-camera dataset for multiple computer vision problems and its application to novel view-point synthesis [120.4361056355332]
This thesis introduces Paired Image and Video data from three CAMeraS, namely PIV3CAMS. The PIV3CAMS dataset consists of 8385 pairs of images and 82 pairs of videos taken from three different cameras. In addition to the regeneration of a current state-of-the-art algorithm, we investigate several proposed alternative models that integrate depth information geometrically.
arXiv Detail & Related papers (2024-07-26T12:18:29Z)
HUP-3D: A 3D multi-view synthetic dataset for assisted-egocentric hand-ultrasound pose estimation [11.876066932162873]
HUP-3D is a 3D multiview synthetic dataset for hand-ultrasound probe pose estimation. Our dataset consists of over 31k sets of movements. Our approach includes image rendering concept, enhancing diversity with various hand and arm textures.
arXiv Detail & Related papers (2024-07-12T12:25:42Z)
HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction [16.363878619678367]
We introduce a new dataset named HO-Cap that can be used to study 3D reconstruction and pose tracking of hands and objects in videos. We propose a semi-automatic method to obtain annotations of shape and pose of hands and objects in the collected videos. Our data capture setup and annotation framework can be used by the community to reconstruct 3D shapes of objects and human hands and track their poses in videos.
arXiv Detail & Related papers (2024-06-10T23:25:19Z)
AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
Learning to Deblur and Rotate Motion-Blurred Faces [43.673660541417995]
We train a neural network to reconstruct a 3D video representation from a single image and the corresponding face gaze. We then provide a camera viewpoint relative to the estimated gaze and the blurry image as input to an encoder-decoder network to generate a video of sharp frames with a novel camera viewpoint.
arXiv Detail & Related papers (2021-12-14T17:51:19Z)
4D-Net for Learned Multi-Modal Alignment [87.58354992455891]
We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time. We are able to incorporate the 4D information by performing a novel connection learning across various feature representations and levels of abstraction, as well as by observing geometric constraints.
arXiv Detail & Related papers (2021-09-02T16:35:00Z)
D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions. Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints. We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z)
Self-Supervised Multi-View Synchronization Learning for 3D Pose Estimation [39.334995719523]
Current methods cast monocular 3D human pose estimation as a learning problem by training neural networks on large data sets of images and corresponding skeleton poses. We propose an approach that can exploit small annotated data sets by fine-tuning networks pre-trained via self-supervised learning on (large) unlabeled data sets. We demonstrate the effectiveness of the synchronization task on the Human3.6M data set and achieve state-of-the-art results in 3D human pose estimation.
arXiv Detail & Related papers (2020-10-13T08:01:24Z)
MM-Hand: 3D-Aware Multi-Modal Guided Hand Generative Network for 3D Hand Pose Synthesis [81.40640219844197]
Estimating the 3D hand pose from a monocular RGB image is important but challenging. A solution is training on large-scale RGB hand images with accurate 3D hand keypoint annotations. We have developed a learning-based approach to synthesize realistic, diverse, and 3D pose-preserving hand images.
arXiv Detail & Related papers (2020-10-02T18:27:34Z)
Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data [77.34069717612493]
We present a novel method for monocular hand shape and pose estimation at unprecedented runtime performance of 100fps. This is enabled by a new learning based architecture designed such that it can make use of all the sources of available hand training data. It features a 3D hand joint detection module and an inverse kinematics module which regresses not only 3D joint positions but also maps them to joint rotations in a single feed-forward pass.
arXiv Detail & Related papers (2020-03-21T03:51:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.