Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig
- URL: http://arxiv.org/abs/2510.02601v1
- Date: Thu, 02 Oct 2025 22:26:03 GMT
- Title: Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig
- Authors: Patrick Rim, Kun He, Kevin Harris, Braden Copple, Shangchen Han, Sizhe An, Ivan Shugurov, Tomas Hodan, He Wen, Xu Xie,
- Abstract summary: We introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects.<n>We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views.<n>We collect an annotated dataset featuring synchronized multi-view images and precise 3D hand poses.
- Score: 14.496137517475743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate 3D tracking of hands and their interactions with the world in unconstrained settings remains a significant challenge for egocentric computer vision. With few exceptions, existing datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization. To address this, we introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects, which allows for nearly unconstrained mobility in genuinely in-the-wild conditions. We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views. We design an ego-exo tracking pipeline to generate accurate 3D hand pose ground truth from this system, and rigorously evaluate its quality. By collecting an annotated dataset featuring synchronized multi-view images and precise 3D hand poses, we demonstrate the capability of our approach to significantly reduce the trade-off between environmental realism and 3D annotation accuracy.
Related papers
- SpatialTrackerV2: 3D Point Tracking Made Easy [73.0350898700048]
SpatialTrackerV2 is a feed-forward 3D point tracking method for monocular videos.<n>It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion.<n>By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%.
arXiv Detail & Related papers (2025-07-16T17:59:03Z) - FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video [52.33896173943054]
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications.<n>Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings.<n>We propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction.
arXiv Detail & Related papers (2025-03-29T14:26:06Z) - HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos [9.513100627302755]
The dataset offers over 833 minutes (3.7M+ images) of recordings that feature 19 subjects interacting with 33 diverse rigid objects.<n>The recordings include multiple synchronized data streams containing egocentric multi-view RGB/monochrome images, eye gaze signal, scene point clouds, and 3D poses of cameras, hands, and objects.<n>In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, model-based 6DoF object pose estimation, and 3D lifting of unknown in-hand objects.
arXiv Detail & Related papers (2024-11-28T14:09:42Z) - Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based
Motion Refinement [65.08165593201437]
We explore egocentric whole-body motion capture using a single fisheye camera, which simultaneously estimates human body and hand motion.
This task presents significant challenges due to the lack of high-quality datasets, fisheye camera distortion, and human body self-occlusion.
We propose a novel approach that leverages FisheyeViT to extract fisheye image features, which are converted into pixel-aligned 3D heatmap representations for 3D human body pose prediction.
arXiv Detail & Related papers (2023-11-28T07:13:47Z) - Multi-Modal Dataset Acquisition for Photometrically Challenging Object [56.30027922063559]
This paper addresses the limitations of current datasets for 3D vision tasks in terms of accuracy, size, realism, and suitable imaging modalities for photometrically challenging objects.
We propose a novel annotation and acquisition pipeline that enhances existing 3D perception and 6D object pose datasets.
arXiv Detail & Related papers (2023-08-21T10:38:32Z) - UmeTrack: Unified multi-view end-to-end hand tracking for VR [34.352638006495326]
Real-time tracking of 3D hand pose in world space is a challenging problem and plays an important role in VR interaction.
We present a unified end-to-end differentiable framework for multi-view, multi-frame hand tracking that directly predicts 3D hand pose in world space.
arXiv Detail & Related papers (2022-10-31T19:09:21Z) - MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision [72.5863451123577]
We show how to train a neural model that can perform accurate 3D pose and camera estimation.
Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines.
arXiv Detail & Related papers (2021-08-10T18:39:56Z) - SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera [97.0162841635425]
We present a solution to egocentric 3D body pose estimation from monocular images captured from downward looking fish-eye cameras installed on the rim of a head mounted VR device.
This unusual viewpoint leads to images with unique visual appearance, with severe self-occlusions and perspective distortions.
We propose an encoder-decoder architecture with a novel multi-branch decoder designed to account for the varying uncertainty in 2D predictions.
arXiv Detail & Related papers (2020-11-02T16:18:06Z) - Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated
Convolution [34.301501457959056]
We propose a temporal regression network with a gated convolution module to transform 2D joints to 3D.
A simple yet effective localization approach is also conducted to transform the normalized pose to the global trajectory.
Our proposed method outperforms most state-of-the-art 2D-to-3D pose estimation methods.
arXiv Detail & Related papers (2020-10-31T04:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.