Related papers: Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities

Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities

URL: http://arxiv.org/abs/2510.02264v1
Date: Thu, 02 Oct 2025 17:44:31 GMT
Title: Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities
Authors: Mario Medrano-Paredes, Carmen Fernández-González, Francisco-Javier Díaz-Pernas, Hichem Saoudi, Javier González-Alonso, Mario Martínez-Zarzuela,
Abstract summary: This study compares monocular video-based 3D human pose estimation models with inertial measurement units (IMUs)<n>Joint angles derived from state-of-the-art deep learning frameworks were evaluated against joint angles computed from IMU data.<n>MotionAGFormer demonstrated superior performance, achieving the lowest overall RMSE.
Score: 1.3854111346209868
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Advances in machine learning and wearable sensors offer new opportunities for capturing and analyzing human movement outside specialized laboratories. Accurate assessment of human movement under real-world conditions is essential for telemedicine, sports science, and rehabilitation. This preclinical benchmark compares monocular video-based 3D human pose estimation models with inertial measurement units (IMUs), leveraging the VIDIMU dataset containing a total of 13 clinically relevant daily activities which were captured using both commodity video cameras and five IMUs. During this initial study only healthy subjects were recorded, so results cannot be generalized to pathological cohorts. Joint angles derived from state-of-the-art deep learning frameworks (MotionAGFormer, MotionBERT, MMPose 2D-to-3D pose lifting, and NVIDIA BodyTrack) were evaluated against joint angles computed from IMU data using OpenSim inverse kinematics following the Human3.6M dataset format with 17 keypoints. Among them, MotionAGFormer demonstrated superior performance, achieving the lowest overall RMSE ($9.27\deg \pm 4.80\deg$) and MAE ($7.86\deg \pm 4.18\deg$), as well as the highest Pearson correlation ($0.86 \pm 0.15$) and the highest coefficient of determination $R^{2}$ ($0.67 \pm 0.28$). The results reveal that both technologies are viable for out-of-the-lab kinematic assessment. However, they also highlight key trade-offs between video- and sensor-based approaches including costs, accessibility, and precision. This study clarifies where off-the-shelf video models already provide clinically promising kinematics in healthy adults and where they lag behind IMU-based estimates while establishing valuable guidelines for researchers and clinicians seeking to develop robust, cost-effective, and user-friendly solutions for telehealth and remote patient monitoring.

Related papers

SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking [19.28827026574636]
We present a biomechanics-aware keypoint simulation framework that augments human pose datasets with anatomically consistent 3D spinal keypoints.<n>We create the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full-body motions.<n>With 2.14 million frames, this enables data-driven learning of vertebral kinematics from subtle posture variations.
arXiv Detail & Related papers (2026-02-24T11:31:20Z)
Monocular Markerless Motion Capture Enables Quantitative Assessment of Upper Extremity Reachable Workspace [1.7520168411745887]
To validate a clinically accessible approach for quantifying the Upper Extremity Reachable Workspace.<n>Single (monocular) camera and Artificial Intelligence (AI)-driven Markerless Motion Capture (MMC) for biomechanical analysis.<n>Findings support the feasibility of a frontal monocular camera configuration for UERW assessment.
arXiv Detail & Related papers (2026-02-13T18:36:27Z)
UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos [81.9180187964947]
We present UniSurg, a foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction.<n>To enable large-scale pretraining, we curate the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions.<n>These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.
arXiv Detail & Related papers (2026-02-05T13:18:33Z)
NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification [56.133469598652624]
Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding.<n>Neurosurgical Anatomy Benchmark (NeuroABench) is first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain.<n>NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures.
arXiv Detail & Related papers (2025-12-07T17:00:25Z)
GAITEX: Human motion dataset of impaired gait and rehabilitation exercises using inertial and optical sensors [0.769672852567215]
We present a multimodal dataset of physiotherapeutic and gait-related exercises, including correct and clinically relevant variants.<n>It contains data from nine IMUs and 68 markers tracking full-body kinematics.<n>The dataset is fully annotated with movement quality ratings and timestamped segmentations.
arXiv Detail & Related papers (2025-06-06T08:08:18Z)
Predicting Length of Stay in Neurological ICU Patients Using Classical Machine Learning and Neural Network Models: A Benchmark Study on MIMIC-IV [49.1574468325115]
This study explores multiple ML approaches for predicting LOS in ICU specifically for the patients with neurological diseases based on the MIMIC-IV dataset.<n>The evaluated models include classic ML algorithms (K-Nearest Neighbors, Random Forest, XGBoost and CatBoost) and Neural Networks (LSTM, BERT and Temporal Fusion Transformer)
arXiv Detail & Related papers (2025-05-23T14:06:42Z)
Validation of Human Pose Estimation and Human Mesh Recovery for Extracting Clinically Relevant Motion Data from Videos [79.62407455005561]
Marker-less motion capture using human pose estimation produces results in-line with the results of both the IMU and MoCap kinematics.<n>While there is still room for improvement when it comes to the quality of the data produced, we believe that this compromise is within the room of error.
arXiv Detail & Related papers (2025-03-18T22:18:33Z)
Finetuning and Quantization of EEG-Based Foundational BioSignal Models on ECG and PPG Data for Blood Pressure Estimation [53.2981100111204]
Photoplethysmography and electrocardiography can potentially enable continuous blood pressure (BP) monitoring.<n>Yet accurate and robust machine learning (ML) models remains challenging due to variability in data quality and patient-specific factors.<n>In this work, we investigate whether a model pre-trained on one modality can effectively be exploited to improve the accuracy of a different signal type.<n>Our approach achieves near state-of-the-art accuracy for diastolic BP and surpasses by 1.5x the accuracy of prior works for systolic BP.
arXiv Detail & Related papers (2025-02-10T13:33:12Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
3D Kinematics Estimation from Video with a Biomechanical Model and Synthetic Training Data [4.130944152992895]
We propose a novel biomechanics-aware network that directly outputs 3D kinematics from two input views. Our experiments demonstrate that the proposed approach, only trained on synthetic data, outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2024-02-20T17:33:40Z)
Next-generation Surgical Navigation: Marker-less Multi-view 6DoF Pose Estimation of Surgical Instruments [64.59698930334012]
We present a multi-camera capture setup consisting of static and head-mounted cameras.<n>Second, we publish a multi-view RGB-D video dataset of ex-vivo spine surgeries, captured in a surgical wet lab and a real operating theatre.<n>Third, we evaluate three state-of-the-art single-view and multi-view methods for the task of 6DoF pose estimation of surgical instruments.
arXiv Detail & Related papers (2023-05-05T13:42:19Z)
Multimodal video and IMU kinematic dataset on daily life activities using affordable devices (VIDIMU) [0.0]
The objective of the dataset is to pave the way towards affordable patient gross motor tracking solutions for daily life activities recognition and kinematic analysis. The novelty of dataset lies in: (i) the clinical relevance of the chosen movements, (ii) the combined utilization of affordable video and custom sensors, and (iii) the implementation of state-of-the-art tools for multimodal data processing of 3D body pose tracking and motion reconstruction.
arXiv Detail & Related papers (2023-03-27T14:05:49Z)
Appearance Learning for Image-based Motion Estimation in Tomography [60.980769164955454]
In tomographic imaging, anatomical structures are reconstructed by applying a pseudo-inverse forward model to acquired signals. Patient motion corrupts the geometry alignment in the reconstruction process resulting in motion artifacts. We propose an appearance learning approach recognizing the structures of rigid motion independently from the scanned object.
arXiv Detail & Related papers (2020-06-18T09:49:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.