EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR
- URL: http://arxiv.org/abs/2603.04090v1
- Date: Wed, 04 Mar 2026 14:01:16 GMT
- Title: EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR
- Authors: Zhenyu Li, Sai Kumar Dwivedi, Filip Maric, Carlos Chacon, Nadine Bertsch, Filippo Arcadu, Tomas Hodan, Michael Ramamonjisoa, Peter Wonka, Amy Zhao, Robin Kips, Cem Keskin, Anastasia Tkach, Chenhongyi Yang,
- Abstract summary: We present a transformer-based model for temporally consistent and spatially grounded body pose estimation.<n>We also present an auto-labeling system that enables the use of large unlabeled datasets for training.<n>On the EgoBody3M benchmark, with a 0.8 ms latency on GPU, our model outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%.
- Score: 43.739084350055435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training. Our model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget. The auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. On the EgoBody3M benchmark, with a 0.8 ms latency on GPU, our model outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%. Furthermore, our auto-labeling system further improves the wrist MPJPE by 13.1%.
Related papers
- EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality [1.749869555855672]
EgoPoseVR is an end-to-end framework for accurate egocentric full-body pose estimation in virtual reality (VR)<n>It integrates headset motion cues with egocentric RGB-D observations through a dual-modality fusion pipeline.<n>A user study in real-world scenes shows that EgoPoseVR achieved significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use.
arXiv Detail & Related papers (2026-02-05T12:17:35Z) - Smooth-Distill: A Self-distillation Framework for Multitask Learning with Wearable Sensor Data [0.0]
This paper introduces Smooth-Distill, a novel self-distillation framework designed to simultaneously perform human activity recognition (HAR) and sensor placement detection.<n>Unlike conventional distillation methods that require separate teacher and student models, the proposed framework utilizes a smoothed, historical version of the model itself as the teacher.<n> Experimental results show that Smooth-Distill consistently outperforms alternative approaches across different evaluation scenarios.
arXiv Detail & Related papers (2025-06-27T06:51:51Z) - LSM-2: Learning from Incomplete Wearable Sensor Data [65.58595667477505]
This paper introduces the second generation of Large Sensor Model (LSM-2) with Adaptive and Inherited Masking (AIM)<n>AIM learns robust representations directly from incomplete data without requiring explicit imputation.<n>Our LSM-2 with AIM achieves the best performance across a diverse range of tasks, including classification, regression and generative modeling.
arXiv Detail & Related papers (2025-06-05T17:57:11Z) - MELON: Multimodal Mixture-of-Experts with Spectral-Temporal Fusion for Long-Term Mobility Estimation in Critical Care [1.5237145555729716]
We introduce MELON, a novel framework designed to predict 12-hour mobility status in the critical care setting.<n>We trained and evaluated the MELON model on the multimodal dataset of 126 patients recruited from nine Intensive Care Units at the University of Florida Health Shands Hospital main campus in Gainesville, Florida.<n>Results showed that MELON outperforms conventional approaches for 12-hour mobility status estimation.
arXiv Detail & Related papers (2025-03-10T19:47:46Z) - Estimating Body and Hand Motion in an Ego-sensed World [62.61989004520802]
We present EgoAllo, a system for human motion estimation from a head-mounted device.<n>Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters.
arXiv Detail & Related papers (2024-10-04T17:59:57Z) - Benchmarking Adaptive Intelligence and Computer Vision on Human-Robot Collaboration [0.0]
Human-Robot Collaboration (HRC) is vital in Industry 4.0, using sensors, digital twins, collaborative robots (cobots) and intention-recognition models to have efficient manufacturing processes.
We address concept drift by integrating Adaptive Intelligence and self-labeling to improve the resilience of intention-recognition in an HRC system.
arXiv Detail & Related papers (2024-09-30T01:25:48Z) - Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement [28.370473108391426]
This paper briefly introduces the solutions developed by our team, HFUT-VUT, for Track 1 of self-supervised heart rate measurement.
The goal is to develop a self-supervised Physiological for heart rate (HR) using unlabeled facial videos.
Our solutions achieved a remarkable RMSE score of 8.85277 on the test dataset, securing bftext2nd place in Track 1 of the challenge.
arXiv Detail & Related papers (2024-06-07T13:53:02Z) - Coordinate Transformer: Achieving Single-stage Multi-person Mesh
Recovery from Videos [91.44553585470688]
Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond.
We propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner.
Experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively.
arXiv Detail & Related papers (2023-08-20T18:23:07Z) - Self-Supervised Representation Learning from Temporal Ordering of
Automated Driving Sequences [49.91741677556553]
We propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks.
We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for object detection or tracking systems.
Extensive evaluations on the BDD100K, nuImages, and MOT17 datasets show that our TempO pre-training approach outperforms single-frame self-supervised learning methods.
arXiv Detail & Related papers (2023-02-17T18:18:27Z) - Automatic Severity Classification of Dysarthric speech by using
Self-supervised Model with Multi-task Learning [4.947423926765435]
We propose a novel automatic severity assessment method for dysarthric speech using the self-supervised model in conjunction with multi-task learning.
Wav2vec 2.0 XLS-R is trained for two different tasks: severity classification and auxiliary automatic speech recognition (ASR)
Our model outperforms the traditional baseline methods, with a relative percentage increase of 1.25% for F1-score.
arXiv Detail & Related papers (2022-10-27T12:48:10Z) - SODA10M: Towards Large-Scale Object Detection Benchmark for Autonomous
Driving [94.11868795445798]
We release a Large-Scale Object Detection benchmark for Autonomous driving, named as SODA10M, containing 10 million unlabeled images and 20K images labeled with 6 representative object categories.
To improve diversity, the images are collected every ten seconds per frame within 32 different cities under different weather conditions, periods and location scenes.
We provide extensive experiments and deep analyses of existing supervised state-of-the-art detection models, popular self-supervised and semi-supervised approaches, and some insights about how to develop future models.
arXiv Detail & Related papers (2021-06-21T13:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.