Related papers: 3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras

3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras

URL: http://arxiv.org/abs/2509.23455v1
Date: Sat, 27 Sep 2025 18:55:21 GMT
Title: 3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras
Authors: Tharindu Ekanayake, Constantino Álvarez Casado, Miguel Bordallo López,
Abstract summary: 3DPCNet is a compact, estimator-agnostic module that operates directly on 3D joint coordinates.<n>Our method produces acceleration signals from video that show strong visual correspondence to ground-truth IMU sensor data.
Score: 7.906702226082628
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Monocular 3D pose estimators produce camera-centered skeletons, creating view-dependent kinematic signals that complicate comparative analysis in applications such as health and sports science. We present 3DPCNet, a compact, estimator-agnostic module that operates directly on 3D joint coordinates to rectify any input pose into a consistent, body-centered canonical frame. Its hybrid encoder fuses local skeletal features from a graph convolutional network with global context from a transformer via a gated cross-attention mechanism. From this representation, the model predicts a continuous 6D rotation that is mapped to an $SO(3)$ matrix to align the pose. We train the model in a self-supervised manner on the MM-Fi dataset using synthetically rotated poses, guided by a composite loss ensuring both accurate rotation and pose reconstruction. On the MM-Fi benchmark, 3DPCNet reduces the mean rotation error from over 20$^{\circ}$ to 3.4$^{\circ}$ and the Mean Per Joint Position Error from ~64 mm to 47 mm compared to a geometric baseline. Qualitative evaluations on the TotalCapture dataset further demonstrate that our method produces acceleration signals from video that show strong visual correspondence to ground-truth IMU sensor data, confirming that our module removes viewpoint variability to enable physically plausible motion analysis.

Related papers

From Camera to World: A Plug-and-Play Module for Human Mesh Transformation [1.5453237467077674]
We propose Mesh-Plug, a plug-and-play module that transforms human meshes from camera coordinates to world coordinates.<n>Key innovation lies in a human-centered approach that leverages both RGB images and depth maps rendered from the initial mesh to estimate camera rotation parameters.<n>Our framework outperforms state-of-the-art methods on the benchmark datasets SPEC-SYN and SPEC-MTP.
arXiv Detail & Related papers (2025-12-17T09:05:46Z)
Controllable Human-centric Keyframe Interpolation with Generative Prior [55.16558476905587]
We introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process.<n>We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video.
arXiv Detail & Related papers (2025-06-03T17:50:05Z)
PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting [54.7468067660037]
PF3plat sets a new state-of-the-art across all benchmarks, supported by comprehensive ablation studies validating our design choices.<n>Our framework capitalizes on fast speed, scalability, and high-quality 3D reconstruction and view synthesis capabilities of 3DGS.
arXiv Detail & Related papers (2024-10-29T15:28:15Z)
Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens [34.50928515515274]
We present a novel method to estimate 3D human pose and shape from monocular videos. The proposed method attains superior performances on the 3DPW and Human3.6M datasets.
arXiv Detail & Related papers (2023-03-01T07:48:01Z)
A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion [13.88656793940129]
We propose an adaptive, identity-aware triangulation module to reconstruct 3D joints and identify the missing joints for each identity. We then propose a Dual-Masked Auto-Encoder (D-MAE) which encodes the joint status with both skeletal-structural and temporal position encoding for trajectory completion. In order to demonstrate the proposed model's capability in dealing with severe data loss scenarios, we contribute a high-accuracy and challenging motion capture dataset.
arXiv Detail & Related papers (2022-07-15T10:00:43Z)
PolarFormer: Multi-camera 3D Object Detection with Polar Transformers [93.49713023975727]
3D object detection in autonomous driving aims to reason "what" and "where" the objects of interest present in a 3D world. Existing methods often adopt the canonical Cartesian coordinate system with perpendicular axis. We propose a new Polar Transformer (PolarFormer) for more accurate 3D object detection in the bird's-eye-view (BEV) taking as input only multi-camera 2D images.
arXiv Detail & Related papers (2022-06-30T16:32:48Z)
Geometry-Contrastive Transformer for Generalized 3D Pose Transfer [95.56457218144983]
The intuition of this work is to perceive the geometric inconsistency between the given meshes with the powerful self-attention mechanism. We propose a novel geometry-contrastive Transformer that has an efficient 3D structured perceiving ability to the global geometric inconsistencies. We present a latent isometric regularization module together with a novel semi-synthesized dataset for the cross-dataset 3D pose transfer task.
arXiv Detail & Related papers (2021-12-14T13:14:24Z)
Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild [25.647676661390282]
This paper addresses the problem of 3D human body shape and pose estimation from an RGB image. We train a deep neural network to estimate a hierarchical matrix-Fisher distribution over relative 3D joint rotation matrices. We show that our method is competitive with the state-of-the-art in terms of 3D shape and pose metrics on the SSP-3D and 3DPW datasets.
arXiv Detail & Related papers (2021-10-03T11:59:37Z)
HybrIK: A Hybrid Analytical-Neural Inverse Kinematics Solution for 3D Human Pose and Shape Estimation [39.67289969828706]
We propose a novel hybrid inverse kinematics solution (HybrIK) to bridge the gap between body mesh estimation and 3D keypoint estimation. HybrIK directly transforms accurate 3D joints to relative body-part rotations for 3D body mesh reconstruction. We show that HybrIK preserves both the accuracy of 3D pose and the realistic body structure of the parametric human model.
arXiv Detail & Related papers (2020-11-30T10:32:30Z)
Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image. Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space. We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step. This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z)
3D Pose Detection in Videos: Focusing on Occlusion [0.4588028371034406]
We build upon existing methods for occlusion-aware 3D pose detection in videos. We implement a two stage architecture that consists of the stacked hourglass network to produce 2D pose predictions. To facilitate prediction on poses with occluded joints, we introduce an intuitive generalization of the cylinder man model.
arXiv Detail & Related papers (2020-06-24T07:01:17Z)
MotioNet: 3D Human Motion Reconstruction from Monocular Video with Skeleton Consistency [72.82534577726334]
We introduce MotioNet, a deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video. Our method is the first data-driven approach that directly outputs a kinematic skeleton, which is a complete, commonly used, motion representation.
arXiv Detail & Related papers (2020-06-22T08:50:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.