Related papers: Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors

Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors

URL: http://arxiv.org/abs/2505.05336v1
Date: Thu, 08 May 2025 15:28:09 GMT
Title: Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors
Authors: Zunjie Zhu, Yan Zhao, Yihan Hu, Guoxiang Wang, Hai Qiu, Bolun Zheng, Chenggang Yan, Feng Xu,
Abstract summary: Full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range.<n>Previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external visual sensors to obtain global positions of key joints.<n>To improve the practicality of the technology for virtual reality applications, we estimate full-body poses using only inertial data obtained from three Inertial Measurement Unit (IMU) sensors worn on the head and wrists.
Score: 25.67875816218477
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The motion capture system that supports full-body virtual representation is of key significance for virtual reality. Compared to vision-based systems, full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range. However, previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external visual sensors to obtain global positions of key joints. To improve the practicality of the technology for virtual reality applications, we estimate full-body poses using only inertial data obtained from three Inertial Measurement Unit (IMU) sensors worn on the head and wrists, thereby reducing the complexity of the hardware system. In this work, we propose a method called Progressive Inertial Poser (ProgIP) for human pose estimation, which combines neural network estimation with a human dynamics model, considers the hierarchical structure of the kinematic chain, and employs a multi-stage progressive network estimation with increased depth to reconstruct full-body motion in real time. The encoder combines Transformer Encoder and bidirectional LSTM (TE-biLSTM) to flexibly capture the temporal dependencies of the inertial sequence, while the decoder based on multi-layer perceptrons (MLPs) transforms high-dimensional features and accurately projects them onto Skinned Multi-Person Linear (SMPL) model parameters. Quantitative and qualitative experimental results on multiple public datasets show that our method outperforms state-of-the-art methods with the same inputs, and is comparable to recent works using six IMU sensors.

Related papers

Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera [54.967647497048205]
We present Stereo-Inertial Poser, a real-time motion capture system that estimates metric-accurate and shape-aware 3D human motion.<n>We replace the monocular RGB with stereo vision, enabling direct 3D keypoint extraction and body shape parameter estimation.<n>Our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
arXiv Detail & Related papers (2026-03-02T17:46:38Z)
Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging [28.86800972797388]
Group Inertial Poser is a novel approach for robustly estimating body poses and global translation for multiple individuals.<n>Group Inertial Poser estimates these absolute distances between pairs of sensors from ultra-wideband ranging (UWB)<n>We introduce GIP-DB, the first IMU+UWB dataset for two-person tracking, which comprises 200 minutes of motion recordings from 14 participants.
arXiv Detail & Related papers (2025-10-24T17:11:50Z)
Visual Odometry with Transformers [68.453547770334]
We introduce Visual odometry Transformer (VoT), which processes sequences of monocular frames by extracting features.<n>Unlike prior methods, VoT directly predicts camera motion without estimating dense geometry and relies solely on camera poses for supervision.<n>VoT scales effectively with larger datasets, benefits substantially from stronger pre-trained backbones, generalizes across diverse camera motions and calibration settings, and outperforms traditional methods while running more than 3 times faster.
arXiv Detail & Related papers (2025-10-02T17:00:14Z)
MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU, GNSS for High-Functionality SLAM [12.158063913401575]
We propose MASt3R-Fusion, a multi-sensor-assisted visual SLAM framework that integrates feed-forward pointmap regression with complementary sensor information.<n>A hierarchical factor graph design is developed, which allows both real-time sliding-window optimization and global optimization with aggressive loop closures.<n>We evaluate our approach on both public benchmarks and self-collected datasets, demonstrating substantial improvements in accuracy and robustness.
arXiv Detail & Related papers (2025-09-25T05:26:28Z)
Synth It Like KITTI: Synthetic Data Generation for Object Detection in Driving Scenarios [3.30184292168618]
We propose a dataset generation pipeline based on the CARLA simulator for 3D object detection on LiDAR point clouds.<n>We are able to train an object detector on the synthetic data and demonstrate strong generalization capabilities to the KITTI dataset.
arXiv Detail & Related papers (2025-02-20T22:27:42Z)
Sparse identification of nonlinear dynamics and Koopman operators with Shallow Recurrent Decoder Networks [3.1484174280822845]
We present a method to jointly solve the sensing and model identification problems with simple implementation, efficient, and robust performance.<n>SINDy-SHRED uses Gated Recurrent Units to model sparse sensor measurements along with a shallow network decoder to reconstruct the full-temporal field from the latent state space.<n>We conduct systematic experimental studies on PDE data such as turbulent flows, real-world sensor measurements for sea surface temperature, and direct video data.
arXiv Detail & Related papers (2025-01-23T02:18:13Z)
XR-MBT: Multi-modal Full Body Tracking for XR through Self-Supervision with Learned Depth Point Cloud Registration [19.874691210555472]
XR-MBT tracks legs in XR for the first time, whereas traditional synthesis approaches based on partial body tracking are blind.<n>We demonstrate how current 3-point motion synthesis models can be extended to point cloud modalities.
arXiv Detail & Related papers (2024-11-27T14:25:32Z)
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks. We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z)
Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors [17.3834029178939]
This paper introduces a novel human pose estimation approach using sparse inertial sensors. It leverages a diverse array of real inertial motion capture data from different skeleton formats to improve motion diversity and model generalization. The approach demonstrates superior performance over state-of-the-art models across five public datasets, notably reducing pose error by 19% on the DIP-IMU dataset.
arXiv Detail & Related papers (2023-12-02T13:17:10Z)
Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling [13.284947022380404]
We propose a two-stage framework that can obtain accurate and smooth full-body motions with three tracking signals of head and hands only. Our framework explicitly models the joint-level features in the first stage and utilizes them astemporal tokens for alternating spatial and temporal transformer blocks to capture joint-level correlations in the second stage. With extensive experiments on the AMASS motion dataset and real-captured data, we show our proposed method can achieve more accurate and smooth motion compared to existing approaches.
arXiv Detail & Related papers (2023-08-17T08:27:55Z)
UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR. UniTR processes a variety of modalities with unified modeling and shared parameters. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z)
HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for Autonomous Driving [95.42203932627102]
3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians. Our method efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin. Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages.
arXiv Detail & Related papers (2022-12-15T11:15:14Z)
Progressive Multi-view Human Mesh Recovery with Self-Supervision [68.60019434498703]
Existing solutions typically suffer from poor generalization performance to new settings. We propose a novel simulation-based training pipeline for multi-view human mesh recovery.
arXiv Detail & Related papers (2022-12-10T06:28:29Z)
Back to MLP: A Simple Baseline for Human Motion Prediction [59.18776744541904]
This paper tackles the problem of human motion prediction, consisting in forecasting future body poses from historically observed sequences. We show that the performance of these approaches can be surpassed by a light-weight and purely architectural architecture with only 0.14M parameters. An exhaustive evaluation on Human3.6M, AMASS and 3DPW datasets shows that our method, which we dub siMLPe, consistently outperforms all other approaches.
arXiv Detail & Related papers (2022-07-04T16:35:58Z)
Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs [79.72586714047199]
We propose an attention-based deep learning method to reconstruct full-body motion from six IMU sensors in real-time. Our method achieves new state-of-the-art results both quantitatively and qualitatively, while being simple to implement and smaller in size.
arXiv Detail & Related papers (2022-03-29T16:24:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.