UmeTrack: Unified multi-view end-to-end hand tracking for VR
- URL: http://arxiv.org/abs/2211.00099v1
- Date: Mon, 31 Oct 2022 19:09:21 GMT
- Title: UmeTrack: Unified multi-view end-to-end hand tracking for VR
- Authors: Shangchen Han, Po-chen Wu, Yubo Zhang, Beibei Liu, Linguang Zhang,
Zheng Wang, Weiguang Si, Peizhao Zhang, Yujun Cai, Tomas Hodan, Randi
Cabezas, Luan Tran, Muzaffer Akbay, Tsz-Ho Yu, Cem Keskin, Robert Wang
- Abstract summary: Real-time tracking of 3D hand pose in world space is a challenging problem and plays an important role in VR interaction.
We present a unified end-to-end differentiable framework for multi-view, multi-frame hand tracking that directly predicts 3D hand pose in world space.
- Score: 34.352638006495326
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-time tracking of 3D hand pose in world space is a challenging problem
and plays an important role in VR interaction. Existing work in this space are
limited to either producing root-relative (versus world space) 3D pose or rely
on multiple stages such as generating heatmaps and kinematic optimization to
obtain 3D pose. Moreover, the typical VR scenario, which involves multi-view
tracking from wide \ac{fov} cameras is seldom addressed by these methods. In
this paper, we present a unified end-to-end differentiable framework for
multi-view, multi-frame hand tracking that directly predicts 3D hand pose in
world space. We demonstrate the benefits of end-to-end differentiabilty by
extending our framework with downstream tasks such as jitter reduction and
pinch prediction. To demonstrate the efficacy of our model, we further present
a new large-scale egocentric hand pose dataset that consists of both real and
synthetic data. Experiments show that our system trained on this dataset
handles various challenging interactive motions, and has been successfully
applied to real-time VR applications.
Related papers
- WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild [53.288327629960364]
We present a data-driven pipeline for efficient multi-hand reconstruction in the wild.
The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model.
Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks.
arXiv Detail & Related papers (2024-09-18T18:46:51Z) - Self-Avatar Animation in Virtual Reality: Impact of Motion Signals Artifacts on the Full-Body Pose Reconstruction [13.422686350235615]
We aim to measure the impact on the reconstruction of the articulated self-avatar's full-body pose.
We analyze the motion reconstruction errors using ground truth and 3D Cartesian coordinates estimated from textitYOLOv8 pose estimation.
arXiv Detail & Related papers (2024-04-29T12:02:06Z) - MOVIN: Real-time Motion Capture using a Single LiDAR [7.3228874258537875]
We present MOVIN, the data-driven generative method for real-time motion capture with global tracking.
Our framework accurately predicts the performer's 3D global information and local joint details.
We implement a real-time application to showcase our method in real-world scenarios.
arXiv Detail & Related papers (2023-09-17T16:04:15Z) - Multi-Modal Dataset Acquisition for Photometrically Challenging Object [56.30027922063559]
This paper addresses the limitations of current datasets for 3D vision tasks in terms of accuracy, size, realism, and suitable imaging modalities for photometrically challenging objects.
We propose a novel annotation and acquisition pipeline that enhances existing 3D perception and 6D object pose datasets.
arXiv Detail & Related papers (2023-08-21T10:38:32Z) - Realistic Full-Body Tracking from Sparse Observations via Joint-Level
Modeling [13.284947022380404]
We propose a two-stage framework that can obtain accurate and smooth full-body motions with three tracking signals of head and hands only.
Our framework explicitly models the joint-level features in the first stage and utilizes them astemporal tokens for alternating spatial and temporal transformer blocks to capture joint-level correlations in the second stage.
With extensive experiments on the AMASS motion dataset and real-captured data, we show our proposed method can achieve more accurate and smooth motion compared to existing approaches.
arXiv Detail & Related papers (2023-08-17T08:27:55Z) - Uncertainty-aware State Space Transformer for Egocentric 3D Hand
Trajectory Forecasting [79.34357055254239]
Hand trajectory forecasting is crucial for enabling a prompt understanding of human intentions when interacting with AR/VR systems.
Existing methods handle this problem in a 2D image space which is inadequate for 3D real-world applications.
We set up an egocentric 3D hand trajectory forecasting task that aims to predict hand trajectories in a 3D space from early observed RGB videos in a first-person view.
arXiv Detail & Related papers (2023-07-17T04:55:02Z) - RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB
Video [76.86512780916827]
We present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera.
In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN.
We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline.
arXiv Detail & Related papers (2021-06-22T12:53:56Z) - Monocular Quasi-Dense 3D Object Tracking [99.51683944057191]
A reliable and accurate 3D tracking framework is essential for predicting future locations of surrounding objects and planning the observer's actions in numerous applications such as autonomous driving.
We propose a framework that can effectively associate moving objects over time and estimate their full 3D bounding box information from a sequence of 2D images captured on a moving platform.
arXiv Detail & Related papers (2021-03-12T15:30:02Z) - SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera [97.0162841635425]
We present a solution to egocentric 3D body pose estimation from monocular images captured from downward looking fish-eye cameras installed on the rim of a head mounted VR device.
This unusual viewpoint leads to images with unique visual appearance, with severe self-occlusions and perspective distortions.
We propose an encoder-decoder architecture with a novel multi-branch decoder designed to account for the varying uncertainty in 2D predictions.
arXiv Detail & Related papers (2020-11-02T16:18:06Z) - Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated
Convolution [34.301501457959056]
We propose a temporal regression network with a gated convolution module to transform 2D joints to 3D.
A simple yet effective localization approach is also conducted to transform the normalized pose to the global trajectory.
Our proposed method outperforms most state-of-the-art 2D-to-3D pose estimation methods.
arXiv Detail & Related papers (2020-10-31T04:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.