Related papers: Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context

Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context

URL: http://arxiv.org/abs/2512.19283v1
Date: Mon, 22 Dec 2025 11:26:41 GMT
Title: Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context
Authors: Kyungwon Cho, Hanbyul Joo,
Abstract summary: We present HaMoS, the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues.<n>We also demonstrate that sequence-level contexts such as body shape and field-of-view are crucial for accurate motion reconstruction.
Score: 17.735273173582716
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Egocentric vision systems are becoming widely available, creating new opportunities for human-computer interaction. A core challenge is estimating the wearer's full-body motion from first-person videos, which is crucial for understanding human behavior. However, this task is difficult since most body parts are invisible from the egocentric view. Prior approaches mainly rely on head trajectories, leading to ambiguity, or assume continuously tracked hands, which is unrealistic for lightweight egocentric devices. In this work, we present HaMoS, the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues caused by field-of-view limitations and occlusions, as in real-world egocentric devices. To overcome the lack of datasets pairing diverse camera views with human motion, we introduce a novel augmentation method that models such real-world conditions. We also demonstrate that sequence-level contexts such as body shape and field-of-view are crucial for accurate motion reconstruction, and thus employ local attention to infer long sequences efficiently. Experiments on public benchmarks show that our method achieves state-of-the-art accuracy and temporal smoothness, demonstrating a practical step toward reliable in-the-wild egocentric 3D motion understanding.

Related papers

ECHO: Ego-Centric modeling of Human-Object interactions [71.17118015822699]
ECHO (Ego-Centric modeling of Human-Object interactions) is developed.<n>It recovers three modalities: human pose, object motion, and contact from such minimal observation.<n>It outperforms existing methods that do not offer the same flexibility.
arXiv Detail & Related papers (2025-08-29T12:12:22Z)
EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z)
Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data [16.431101717478796]
Current methods for ego-body pose estimation rely on temporally dense sensor data. We develop a two-stage approach that decomposes the problem into temporal completion and spatial completion.
arXiv Detail & Related papers (2024-11-05T23:53:19Z)
MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos [27.766405152248055]
Hand trajectory prediction plays a vital role in comprehending human motion patterns. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. We propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models.
arXiv Detail & Related papers (2024-09-04T12:06:33Z)
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects [89.95728475983263]
holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. We design the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits. Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks.
arXiv Detail & Related papers (2024-03-25T05:12:21Z)
EgoGen: An Egocentric Synthetic Data Generator [53.32942235801499]
EgoGen is a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks. At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment. We demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views.
arXiv Detail & Related papers (2024-01-16T18:55:22Z)
Ego-Body Pose Estimation via Ego-Head Pose Estimation [22.08240141115053]
Estimating 3D human motion from an egocentric video sequence plays a critical role in human behavior understanding and has various applications in VR/AR. We propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), which decomposes the problem into two stages, connected by the head motion as an intermediate representation. This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion.
arXiv Detail & Related papers (2022-12-09T02:25:20Z)
Context-Aware Sequence Alignment using 4D Skeletal Augmentation [67.05537307224525]
Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality. We propose a novel context-aware self-supervised learning architecture to align sequences of actions. Specifically, CASA employs self-attention and cross-attention mechanisms to incorporate the spatial and temporal context of human actions.
arXiv Detail & Related papers (2022-04-26T10:59:29Z)
Estimating Egocentric 3D Human Pose in the Wild with External Weak Supervision [72.36132924512299]
We present a new egocentric pose estimation method, which can be trained on a large-scale in-the-wild egocentric dataset. We propose a novel learning strategy to supervise the egocentric features with the high-quality features extracted by a pretrained external-view pose estimation model. Experiments show that our method predicts accurate 3D poses from a single in-the-wild egocentric image and outperforms the state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2022-01-20T00:45:13Z)
4D Human Body Capture from Egocentric Video via 3D Scene Grounding [38.3169520384642]
We introduce a novel task of reconstructing a time series of second-person 3D human body meshes from monocular egocentric videos. The unique viewpoint and rapid embodied camera motion of egocentric videos raise additional technical barriers for human body capture.
arXiv Detail & Related papers (2020-11-26T15:17:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.