WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos
- URL: http://arxiv.org/abs/2602.22209v1
- Date: Wed, 25 Feb 2026 18:59:10 GMT
- Title: WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos
- Authors: Yufei Ye, Jiaman Li, Ryan Rong, C. Karen Liu,
- Abstract summary: WHOLE is a method that holistically reconstructs hand and object motion in world space from egocentric videos.<n>We learn a generative prior over hand-object motion to jointly reason about their interactions.<n>This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing.
- Score: 21.692312457166704
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Egocentric manipulation videos are highly challenging due to severe occlusions during interactions and frequent object entries and exits from the camera view as the person moves. Current methods typically focus on recovering either hand or object pose in isolation, but both struggle during interactions and fail to handle out-of-sight cases. Moreover, their independent predictions often lead to inconsistent hand-object relations. We introduce WHOLE, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates. Our key insight is to learn a generative prior over hand-object motion to jointly reason about their interactions. At test time, the pretrained prior is guided to generate trajectories that conform to the video observations. This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing. WHOLE achieves state-of-the-art performance on hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: https://judyye.github.io/whole-www
Related papers
- ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos [22.436134664301473]
We introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos.<n>ForeHOI achieves state-of-the-art performance in object reconstruction, significantly outperforming previous methods with around a 100x speedup.
arXiv Detail & Related papers (2026-02-05T22:05:57Z) - EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos [25.047225764745978]
We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild.<n>In experiments, we prove our method achieving state-of-the-art performance in W-HOI reconstruction.
arXiv Detail & Related papers (2026-01-03T03:08:48Z) - Zero-shot Reconstruction of In-Scene Object Manipulation from Video [47.13702503259619]
We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video.<n>It is challenging due to ill-posed scene reconstruction, ambiguous hand-object depth, and the need for physically plausible interactions.
arXiv Detail & Related papers (2025-12-22T18:58:29Z) - Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views [40.35520614736267]
We present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances.<n>A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision.<n>As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms.
arXiv Detail & Related papers (2025-11-17T02:14:13Z) - EgoLoc: A Generalizable Solution for Temporal Interaction Localization in Egocentric Videos [13.10069586920198]
Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer.<n>We propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos.<n>EgoLoc exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement.
arXiv Detail & Related papers (2025-08-17T12:38:56Z) - Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects [89.95728475983263]
holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation.
We design the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits.
Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks.
arXiv Detail & Related papers (2024-03-25T05:12:21Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency [57.9920824261925]
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment.
modeling realistic hand-object interactions is critical for applications in computer graphics, computer vision, and mixed reality.
GRIP is a learning-based method that takes as input the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction.
arXiv Detail & Related papers (2023-08-22T17:59:51Z) - Full-Body Articulated Human-Object Interaction [61.01135739641217]
CHAIRS is a large-scale motion-captured f-AHOI dataset consisting of 16.2 hours of versatile interactions.
CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process.
By learning the geometrical relationships in HOI, we devise the very first model that leverage human pose estimation.
arXiv Detail & Related papers (2022-12-20T19:50:54Z) - H2O: Two Hands Manipulating Objects for First Person Interaction
Recognition [70.46638409156772]
We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects.
Our method produces annotations of the 3D pose of two hands and the 6D pose of the manipulated objects, along with their interaction labels for each frame.
Our dataset, called H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds.
arXiv Detail & Related papers (2021-04-22T17:10:42Z) - Leveraging Photometric Consistency over Time for Sparsely Supervised
Hand-Object Reconstruction [118.21363599332493]
We present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video.
Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses.
We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy.
arXiv Detail & Related papers (2020-04-28T12:03:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.