EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality
- URL: http://arxiv.org/abs/2602.05590v1
- Date: Thu, 05 Feb 2026 12:17:35 GMT
- Title: EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality
- Authors: Haojie Cheng, Shaun Jing Heng Ong, Shaoyu Cai, Aiden Tat Yang Koh, Fuxi Ouyang, Eng Tat Khoo,
- Abstract summary: EgoPoseVR is an end-to-end framework for accurate egocentric full-body pose estimation in virtual reality (VR)<n>It integrates headset motion cues with egocentric RGB-D observations through a dual-modality fusion pipeline.<n>A user study in real-world scenes shows that EgoPoseVR achieved significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use.
- Score: 1.749869555855672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Immersive virtual reality (VR) applications demand accurate, temporally coherent full-body pose tracking. Recent head-mounted camera-based approaches show promise in egocentric pose estimation, but encounter challenges when applied to VR head-mounted displays (HMDs), including temporal instability, inaccurate lower-body estimation, and the lack of real-time performance. To address these limitations, we present EgoPoseVR, an end-to-end framework for accurate egocentric full-body pose estimation in VR that integrates headset motion cues with egocentric RGB-D observations through a dual-modality fusion pipeline. A spatiotemporal encoder extracts frame- and joint-level representations, which are fused via cross-attention to fully exploit complementary motion cues across modalities. A kinematic optimization module then imposes constraints from HMD signals, enhancing the accuracy and stability of pose estimation. To facilitate training and evaluation, we introduce a large-scale synthetic dataset of over 1.8 million temporally aligned HMD and RGB-D frames across diverse VR scenarios. Experimental results show that EgoPoseVR outperforms state-of-the-art egocentric pose estimation models. A user study in real-world scenes further shows that EgoPoseVR achieved significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use compared to baseline methods. These results show that EgoPoseVR enables robust full-body pose tracking, offering a practical solution for accurate VR embodiment without requiring additional body-worn sensors or room-scale tracking systems.
Related papers
- Gaze Prediction in Virtual Reality Without Eye Tracking Using Visual and Head Motion Cues [3.4383905541567583]
We present a novel gaze prediction framework that combines Head-Mounted Display (HMD) motion signals with visual saliency cues derived from video frames.<n>Our method employs UniSal, a lightweight saliency encoder, to extract visual features, which are then fused with HMD motion data and processed through a time-series prediction module.<n>Experiments on the EHTask dataset, along with deployment on commercial VR hardware, show that our approach consistently outperforms baselines such as Center-of-HMD and Mean Gaze.
arXiv Detail & Related papers (2026-01-26T11:26:27Z) - GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering [0.0]
Foveated rendering significantly reduces computational demands in virtual reality applications.<n>Current approaches require expensive hardware-based eye tracking systems.<n>This paper presents GazeProphet, a software-only approach for predicting gaze locations in VR environments.
arXiv Detail & Related papers (2025-08-19T06:09:23Z) - SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training [82.68200031146299]
We propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data.<n>To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures.
arXiv Detail & Related papers (2025-06-05T17:51:05Z) - SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse Observations [5.040145546652934]
A lightweight and efficient model, SSD-Poser, is designed for robust full-body motion estimation from sparse observations.<n>SSD-Poser incorporates a well-designed hybrid encoder, State Space Attentions, to adapt the state space to complex motion poses.<n>Experiments on the AMASS dataset demonstrate that SSD-Poser achieves exceptional accuracy and computational efficiency.
arXiv Detail & Related papers (2025-04-25T13:18:06Z) - FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video [52.33896173943054]
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications.<n>Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings.<n>We propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction.
arXiv Detail & Related papers (2025-03-29T14:26:06Z) - Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation [69.68568832269285]
Egocentric 3D human pose estimation has been actively studied using cameras installed in front of a head-mounted device (HMD)<n>It remains unclear if the same holds for full-body tracking due to self-occlusion and limited field-of-view coverage.<n>We propose a new transformer-based method that refines 2D joint heatmap estimation with multi-view information and heatmap uncertainty.
arXiv Detail & Related papers (2025-03-14T17:59:54Z) - Estimating Body and Hand Motion in an Ego-sensed World [62.61989004520802]
We present EgoAllo, a system for human motion estimation from a head-mounted device.<n>Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters.
arXiv Detail & Related papers (2024-10-04T17:59:57Z) - Self-Avatar Animation in Virtual Reality: Impact of Motion Signals Artifacts on the Full-Body Pose Reconstruction [13.422686350235615]
We aim to measure the impact on the reconstruction of the articulated self-avatar's full-body pose.
We analyze the motion reconstruction errors using ground truth and 3D Cartesian coordinates estimated from textitYOLOv8 pose estimation.
arXiv Detail & Related papers (2024-04-29T12:02:06Z) - 3D Human Pose Perception from Egocentric Stereo Videos [67.9563319914377]
We propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation.
Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting.
We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.
arXiv Detail & Related papers (2023-12-30T21:21:54Z) - EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere [29.795731025552957]
EgoPoser robustly models body pose from intermittent hand position and orientation tracking only when inside a headset's field of view.
We introduce a novel global motion decomposition method that predicts full-body pose independent of global positions.
We experimentally evaluate our method and show that it outperforms state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-08-12T07:46:50Z) - SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera [97.0162841635425]
We present a solution to egocentric 3D body pose estimation from monocular images captured from downward looking fish-eye cameras installed on the rim of a head mounted VR device.
This unusual viewpoint leads to images with unique visual appearance, with severe self-occlusions and perspective distortions.
We propose an encoder-decoder architecture with a novel multi-branch decoder designed to account for the varying uncertainty in 2D predictions.
arXiv Detail & Related papers (2020-11-02T16:18:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.