Related papers: Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera

Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera

URL: http://arxiv.org/abs/2603.02130v1
Date: Mon, 02 Mar 2026 17:46:38 GMT
Title: Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera
Authors: Tutian Tang, Xingyu Ji, Yutong Li, MingHao Liu, Wenqiang Xu, Cewu Lu,
Abstract summary: We present Stereo-Inertial Poser, a real-time motion capture system that estimates metric-accurate and shape-aware 3D human motion.<n>We replace the monocular RGB with stereo vision, enabling direct 3D keypoint extraction and body shape parameter estimation.<n>Our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
Score: 54.967647497048205
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.

Related papers

Mocap Anywhere: Towards Pairwise-Distance based Motion Capture in the Wild (for the Wild) [17.794355768651698]
Wild-Poser (WiP) is a compact, real-time Transformer-based architecture that predicts 3D joint positions from noisy or corrupted PWD measurements.<n>WiP generalizes across subjects of varying morphologies, including non-human species, without requiring individual body measurements or shape fitting.<n> operating in real time, WiP achieves low joint position error and demonstrates accurate 3D motion reconstruction for both human and animal subjects in-the-wild.
arXiv Detail & Related papers (2026-01-27T11:58:34Z)
Visual Odometry with Transformers [68.453547770334]
We introduce Visual odometry Transformer (VoT), which processes sequences of monocular frames by extracting features.<n>Unlike prior methods, VoT directly predicts camera motion without estimating dense geometry and relies solely on camera poses for supervision.<n>VoT scales effectively with larger datasets, benefits substantially from stronger pre-trained backbones, generalizes across diverse camera motions and calibration settings, and outperforms traditional methods while running more than 3 times faster.
arXiv Detail & Related papers (2025-10-02T17:00:14Z)
Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors [25.67875816218477]
Full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range.<n>Previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external visual sensors to obtain global positions of key joints.<n>To improve the practicality of the technology for virtual reality applications, we estimate full-body poses using only inertial data obtained from three Inertial Measurement Unit (IMU) sensors worn on the head and wrists.
arXiv Detail & Related papers (2025-05-08T15:28:09Z)
UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler [62.06785782635153]
We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains.<n>UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information.<n>Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations.
arXiv Detail & Related papers (2025-02-27T14:03:15Z)
ESVO2: Direct Visual-Inertial Odometry with Stereo Event Cameras [41.992980062962495]
Event-based visual odometry aims at solving tracking and mapping subproblems (typically in parallel)<n>We build an event-based stereo visual-inertial odometry system on top of a direct pipeline.<n>The resulting system scales well with modern high-resolution event cameras.
arXiv Detail & Related papers (2024-10-12T05:35:27Z)
LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment [59.320414108383055]
We present LiveHPS, a novel single-LiDAR-based approach for scene-level human pose and shape estimation. We propose a huge human motion dataset, named FreeMotion, which is collected in various scenarios with diverse human poses.
arXiv Detail & Related papers (2024-02-27T03:08:44Z)
Towards Scale-Aware, Robust, and Generalizable Unsupervised Monocular Depth Estimation by Integrating IMU Motion Dynamics [74.1720528573331]
Unsupervised monocular depth and ego-motion estimation has drawn extensive research attention in recent years. We propose DynaDepth, a novel scale-aware framework that integrates information from vision and IMU motion dynamics. We validate the effectiveness of DynaDepth by conducting extensive experiments and simulations on the KITTI and Make3D datasets.
arXiv Detail & Related papers (2022-07-11T07:50:22Z)
Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs [79.72586714047199]
We propose an attention-based deep learning method to reconstruct full-body motion from six IMU sensors in real-time. Our method achieves new state-of-the-art results both quantitatively and qualitatively, while being simple to implement and smaller in size.
arXiv Detail & Related papers (2022-03-29T16:24:52Z)
TransPose: Real-time 3D Human Translation and Pose Estimation with Six Inertial Sensors [7.565581566766422]
We present TransPose, a DNN-based approach to perform full motion capture from only 6 Inertial Measurement Units (IMUs) at over 90 fps. For body pose estimation, we propose a multi-stage network that estimates leaf-to-full joint positions as intermediate results. For global translation estimation, we propose a supporting-foot-based method and an RNN-based method to robustly solve for the global translations.
arXiv Detail & Related papers (2021-05-10T18:41:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.