From Camera to World: A Plug-and-Play Module for Human Mesh Transformation
- URL: http://arxiv.org/abs/2512.15212v1
- Date: Wed, 17 Dec 2025 09:05:46 GMT
- Title: From Camera to World: A Plug-and-Play Module for Human Mesh Transformation
- Authors: Changhai Ma, Ziyu Wu, Yunkang Zhang, Qijun Ying, Boyan Liu, Xiaohui Cai,
- Abstract summary: We propose Mesh-Plug, a plug-and-play module that transforms human meshes from camera coordinates to world coordinates.<n>Key innovation lies in a human-centered approach that leverages both RGB images and depth maps rendered from the initial mesh to estimate camera rotation parameters.<n>Our framework outperforms state-of-the-art methods on the benchmark datasets SPEC-SYN and SPEC-MTP.
- Score: 1.5453237467077674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reconstructing accurate 3D human meshes in the world coordinate system from in-the-wild images remains challenging due to the lack of camera rotation information. While existing methods achieve promising results in the camera coordinate system by assuming zero camera rotation, this simplification leads to significant errors when transforming the reconstructed mesh to the world coordinate system. To address this challenge, we propose Mesh-Plug, a plug-and-play module that accurately transforms human meshes from camera coordinates to world coordinates. Our key innovation lies in a human-centered approach that leverages both RGB images and depth maps rendered from the initial mesh to estimate camera rotation parameters, eliminating the dependency on environmental cues. Specifically, we first train a camera rotation prediction module that focuses on the human body's spatial configuration to estimate camera pitch angle. Then, by integrating the predicted camera parameters with the initial mesh, we design a mesh adjustment module that simultaneously refines the root joint orientation and body pose. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on the benchmark datasets SPEC-SYN and SPEC-MTP.
Related papers
- Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera [54.967647497048205]
We present Stereo-Inertial Poser, a real-time motion capture system that estimates metric-accurate and shape-aware 3D human motion.<n>We replace the monocular RGB with stereo vision, enabling direct 3D keypoint extraction and body shape parameter estimation.<n>Our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
arXiv Detail & Related papers (2026-03-02T17:46:38Z) - Unified Camera Positional Encoding for Controlled Video Generation [48.5789182990001]
Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI.<n>We introduce Relative Ray, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions.<n>To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types.
arXiv Detail & Related papers (2025-12-08T07:34:01Z) - WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting [51.69408870574092]
We present WorldMirror, an all-in-one, feed-forward model for versatile 3D geometric prediction tasks.<n>Our framework flexibly integrates diverse geometric priors, including camera poses, intrinsics, and depth maps.<n>WorldMirror achieves state-of-the-art performance across diverse benchmarks from camera, point map, depth, and surface normal estimation to novel view synthesis.
arXiv Detail & Related papers (2025-10-12T17:59:09Z) - 3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras [7.906702226082628]
3DPCNet is a compact, estimator-agnostic module that operates directly on 3D joint coordinates.<n>Our method produces acceleration signals from video that show strong visual correspondence to ground-truth IMU sensor data.
arXiv Detail & Related papers (2025-09-27T18:55:21Z) - MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion [13.24058110580706]
We propose a novel global motion averaging framework for multi-camera systems.<n>Our system matches or exceeds incremental SfM accuracy while significantly improving efficiency.
arXiv Detail & Related papers (2025-07-04T05:25:00Z) - Camera Movement Estimation and Path Correction using the Combination of Modified A-SIFT and Stereo System for 3D Modelling [1.6574413179773757]
Efficient camera path generation can help resolve issues in creating accurate and efficient 3D models.<n>A modified version of the Affine Scale-Invariant Feature Transform (ASIFT) is proposed to extract more matching points with reduced computational overhead.<n>A novel two-camera-based rotation correction model is introduced to mitigate small rotational errors.<n>A stereo camera-based translation estimation and correction model is implemented to determine camera movement in 3D space.
arXiv Detail & Related papers (2025-03-22T06:37:54Z) - UniK3D: Universal Camera Monocular 3D Estimation [62.06785782635153]
We present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera.<n>Our method introduces a spherical 3D representation which allows for better disentanglement of camera and scene geometry.<n>A comprehensive zero-shot evaluation on 13 diverse datasets demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and camera metrics.
arXiv Detail & Related papers (2025-03-20T17:49:23Z) - FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction [69.63414788486578]
FreeSplatter is a scalable feed-forward framework that generates high-quality 3D Gaussians from uncalibrated sparse-view images.<n>Our approach employs a streamlined transformer architecture where self-attention blocks facilitate information exchange.<n>We develop two specialized variants--for object-centric and scene-level reconstruction--trained on comprehensive datasets.
arXiv Detail & Related papers (2024-12-12T18:52:53Z) - ESVO2: Direct Visual-Inertial Odometry with Stereo Event Cameras [41.992980062962495]
Event-based visual odometry aims at solving tracking and mapping subproblems (typically in parallel)<n>We build an event-based stereo visual-inertial odometry system on top of a direct pipeline.<n>The resulting system scales well with modern high-resolution event cameras.
arXiv Detail & Related papers (2024-10-12T05:35:27Z) - CAPE: Camera View Position Embedding for Multi-View 3D Object Detection [100.02565745233247]
Current query-based methods rely on global 3D position embeddings to learn the geometric correspondence between images and 3D space.
We propose a novel method based on CAmera view Position Embedding, called CAPE.
CAPE achieves state-of-the-art performance (61.0% NDS and 52.5% mAP) among all LiDAR-free methods on nuScenes dataset.
arXiv Detail & Related papers (2023-03-17T18:59:54Z) - Category-Level Metric Scale Object Shape and Pose Estimation [73.92460712829188]
We propose a framework that jointly estimates a metric scale shape and pose from a single RGB image.
We validated our method on both synthetic and real-world datasets to evaluate category-level object pose and shape.
arXiv Detail & Related papers (2021-09-01T12:16:46Z) - CoMo: A novel co-moving 3D camera system [0.0]
CoMo is a co-moving camera system of two synchronized high speed cameras coupled with rotational stages.
We address the calibration of the external parameters measuring the position of the cameras and their three angles of yaw, pitch and roll in the system "home" configuration.
We evaluate the robustness and accuracy of the system by comparing reconstructed and measured 3D distances in what we call 3D tests, which show a relative error of the order of 1%.
arXiv Detail & Related papers (2021-01-26T13:29:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.