Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space
- URL: http://arxiv.org/abs/2511.13282v2
- Date: Thu, 20 Nov 2025 08:51:24 GMT
- Title: Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space
- Authors: Kaiwen Wang, Kaili Zheng, Yiming Shi, Chenyi Guo, Ji Wu,
- Abstract summary: We introduce a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd.<n>Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images.<n>We also propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale.
- Score: 9.795479102842622
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code is available at: https://github.com/gouba2333/MA-HMR.
Related papers
- AHAP: Reconstructing Arbitrary Humans from Arbitrary Perspectives with Geometric Priors [81.50960055126156]
We present textbfAHAP, a feed-forward framework for reconstructing arbitrary humans from arbitrary camera perspectives.<n>Our core lies in the effective fusion of multi-view geometry to assist human association, reconstruction and localization.<n>A Human Head fuses cross-view features and scene context for SMPL prediction, guided by cross-view reprojection losses to enforce body pose consistency.
arXiv Detail & Related papers (2026-02-27T11:53:45Z) - UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass [83.7071371474926]
UniSH is a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction.<n>Our framework bridges strong, disparate priors from scene reconstruction and HMR.<n>Our model achieves state-of-the-art performance on human-centric scene reconstruction.
arXiv Detail & Related papers (2026-01-03T16:06:27Z) - Human3R: Everyone Everywhere All at Once [69.16576238974876]
We present Human3R, a feed-forward framework for online 4D human-scene reconstruction from monocular videos.<n>Human3R is a unified model that eliminates heavy dependencies and iterative refinement.<n>It delivers superior performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation.
arXiv Detail & Related papers (2025-10-07T17:59:52Z) - HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction [15.368018463074058]
HAMSt3R is an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated images.<n>Our method incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments.
arXiv Detail & Related papers (2025-08-22T14:43:18Z) - Adept: Annotation-Denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining [12.950323493528508]
This paper improves the data scalability of human-centric pretraining methods.<n>We explore semantic information of RGB images in the frequency space by Discrete Cosine Transform (DCT)<n>We also propose new annotation denoising auxiliary tasks with keypoints and DCT maps to enforce the RGB image extractor.
arXiv Detail & Related papers (2025-04-29T14:14:29Z) - Reconstructing People, Places, and Cameras [57.81696692335401]
"Humans and Structure from Motion" (HSfM) is a method for jointly reconstructing multiple human meshes, scene point clouds, and camera parameters in a metric world coordinate system.<n>Our results show that incorporating human data into the SfM pipeline improves camera pose estimation.
arXiv Detail & Related papers (2024-12-23T18:58:34Z) - HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors [47.62426718293504]
HumanSplat predicts the 3D Gaussian Splatting properties of any human from a single input image.
HumanSplat surpasses existing state-of-the-art methods in achieving photorealistic novel-view synthesis.
arXiv Detail & Related papers (2024-06-18T10:05:33Z) - Diffusion Models are Efficient Data Generators for Human Mesh Recovery [55.37787289869703]
We show that synthetic data created by generative models is complementary to CG-rendered data.<n>We propose an effective data generation pipeline based on recent diffusion models, termed HumanWild.<n>Our work could pave the way for scaling up 3D human recovery to in-the-wild scenes.
arXiv Detail & Related papers (2024-03-17T06:31:16Z) - Zolly: Zoom Focal Length Correctly for Perspective-Distorted Human Mesh
Reconstruction [66.10717041384625]
Zolly is the first 3DHMR method focusing on perspective-distorted images.
We propose a new camera model and a novel 2D representation, termed distortion image, which describes the 2D dense distortion scale of the human body.
We extend two real-world datasets tailored for this task, all containing perspective-distorted human images.
arXiv Detail & Related papers (2023-03-24T04:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.