Reconstructing People, Places, and Cameras
- URL: http://arxiv.org/abs/2412.17806v2
- Date: Tue, 20 May 2025 21:13:16 GMT
- Title: Reconstructing People, Places, and Cameras
- Authors: Lea Müller, Hongsuk Choi, Anthony Zhang, Brent Yi, Jitendra Malik, Angjoo Kanazawa,
- Abstract summary: "Humans and Structure from Motion" (HSfM) is a method for jointly reconstructing multiple human meshes, scene point clouds, and camera parameters in a metric world coordinate system.<n>Our results show that incorporating human data into the SfM pipeline improves camera pose estimation.
- Score: 57.81696692335401
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present "Humans and Structure from Motion" (HSfM), a method for jointly reconstructing multiple human meshes, scene point clouds, and camera parameters in a metric world coordinate system from a sparse set of uncalibrated multi-view images featuring people. Our approach combines data-driven scene reconstruction with the traditional Structure-from-Motion (SfM) framework to achieve more accurate scene reconstruction and camera estimation, while simultaneously recovering human meshes. In contrast to existing scene reconstruction and SfM methods that lack metric scale information, our method estimates approximate metric scale by leveraging a human statistical model. Furthermore, it reconstructs multiple human meshes within the same world coordinate system alongside the scene point cloud, effectively capturing spatial relationships among individuals and their positions in the environment. We initialize the reconstruction of humans, scenes, and cameras using robust foundational models and jointly optimize these elements. This joint optimization synergistically improves the accuracy of each component. We compare our method to existing approaches on two challenging benchmarks, EgoHumans and EgoExo4D, demonstrating significant improvements in human localization accuracy within the world coordinate frame (reducing error from 3.51m to 1.04m in EgoHumans and from 2.9m to 0.56m in EgoExo4D). Notably, our results show that incorporating human data into the SfM pipeline improves camera pose estimation (e.g., increasing RRA@15 by 20.3% on EgoHumans). Additionally, qualitative results show that our approach improves overall scene reconstruction quality. Our code is available at: https://github.com/hongsukchoi/HSfM_RELEASE
Related papers
- AHAP: Reconstructing Arbitrary Humans from Arbitrary Perspectives with Geometric Priors [81.50960055126156]
We present textbfAHAP, a feed-forward framework for reconstructing arbitrary humans from arbitrary camera perspectives.<n>Our core lies in the effective fusion of multi-view geometry to assist human association, reconstruction and localization.<n>A Human Head fuses cross-view features and scene context for SMPL prediction, guided by cross-view reprojection losses to enforce body pose consistency.
arXiv Detail & Related papers (2026-02-27T11:53:45Z) - UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass [83.7071371474926]
UniSH is a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction.<n>Our framework bridges strong, disparate priors from scene reconstruction and HMR.<n>Our model achieves state-of-the-art performance on human-centric scene reconstruction.
arXiv Detail & Related papers (2026-01-03T16:06:27Z) - Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space [9.795479102842622]
We introduce a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd.<n>Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images.<n>We also propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale.
arXiv Detail & Related papers (2025-11-17T12:00:13Z) - Human3R: Everyone Everywhere All at Once [69.16576238974876]
We present Human3R, a feed-forward framework for online 4D human-scene reconstruction from monocular videos.<n>Human3R is a unified model that eliminates heavy dependencies and iterative refinement.<n>It delivers superior performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation.
arXiv Detail & Related papers (2025-10-07T17:59:52Z) - HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers [60.86393841247567]
HumanRAM is a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images.<n>Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions.<n> Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets.
arXiv Detail & Related papers (2025-06-03T17:50:05Z) - ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos [18.73641648585445]
Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses.
We introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion.
Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully.
arXiv Detail & Related papers (2025-04-17T17:59:02Z) - Reconstructing Humans with a Biomechanically Accurate Skeleton [55.06027148976482]
We introduce a method for reconstructing 3D humans from a single image using a biomechanically accurate skeleton model.
Compared to state-of-the-art methods for 3D human mesh recovery, our model achieves competitive performance on standard benchmarks.
arXiv Detail & Related papers (2025-03-27T17:56:24Z) - Monocular Person Localization under Camera Ego-motion [5.030357146921396]
We consider person localization as a part of a pose estimation problem.<n>By representing a human with a four-point model, our method jointly estimates the 2D camera attitude and the person's 3D location.<n>Our method is further implemented into a person-following system and deployed on an agile quadruped robot.
arXiv Detail & Related papers (2025-03-04T11:07:27Z) - Crowd3D++: Robust Monocular Crowd Reconstruction with Upright Space [55.77397543011443]
This paper aims to reconstruct hundreds of people's 3D poses, shapes, and locations from a single image with unknown camera parameters.
Crowd3D is proposed to convert the complex 3D human localization into 2D-pixel localization with robust camera and ground estimation.
Crowd3D++ eliminates the influence of camera parameters and the cropping operation by the proposed canonical upright space and ground-aware normalization transform.
arXiv Detail & Related papers (2024-11-09T16:49:59Z) - WHAC: World-grounded Humans and Cameras [37.877565981937586]
We aim to recover expressive parametric human models (i.e., SMPL-X) and corresponding camera poses jointly.
We introduce a novel framework, referred to as WHAC, to facilitate world-grounded expressive human pose and shape estimation.
We present a new synthetic dataset, WHAC-A-Mole, which includes accurately annotated humans and cameras.
arXiv Detail & Related papers (2024-03-19T17:58:02Z) - MUC: Mixture of Uncalibrated Cameras for Robust 3D Human Body Reconstruction [12.942635715952525]
Multiple cameras can provide comprehensive multi-view video coverage of a person.
Previous studies have overlooked the challenges posed by self-occlusion under multiple views.
We introduce a method to reconstruct the 3D human body from multiple uncalibrated camera views.
arXiv Detail & Related papers (2024-03-08T05:03:25Z) - Scene-Aware 3D Multi-Human Motion Capture from a Single Camera [83.06768487435818]
We consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera.
We leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks.
In particular, we estimate the scene depth and unique person scale from normalized disparity predictions using the 2D body joints and joint angles.
arXiv Detail & Related papers (2023-01-12T18:01:28Z) - Embodied Scene-aware Human Pose Estimation [25.094152307452]
We propose embodied scene-aware human pose estimation.
Our method is one stage, causal, and recovers global 3D human poses in a simulated environment.
arXiv Detail & Related papers (2022-06-18T03:50:19Z) - GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras [99.07219478953982]
We present an approach for 3D global human mesh recovery from monocular videos recorded with dynamic cameras.
We first propose a deep generative motion infiller, which autoregressively infills the body motions of occluded humans based on visible motions.
In contrast to prior work, our approach reconstructs human meshes in consistent global coordinates even with dynamic cameras.
arXiv Detail & Related papers (2021-12-02T18:59:54Z) - Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies
from Single RGB Images [5.775625085664381]
We introduce an approach that accurately reconstructs 3D human poses and detailed 3D full-body geometric models from single images in realtime.
Key idea of our approach is a novel end-to-end multi-task deep learning framework that uses single images to predict five outputs simultaneously.
We show the system advances the frontier of 3D human body and pose reconstruction from single images by quantitative evaluations and comparisons with state-of-the-art methods.
arXiv Detail & Related papers (2021-06-22T04:26:11Z) - THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers [67.8628917474705]
THUNDR is a transformer-based deep neural network methodology to reconstruct the 3d pose and shape of people.
We show state-of-the-art results on Human3.6M and 3DPW, for both the fully-supervised and the self-supervised models.
We observe very solid 3d reconstruction performance for difficult human poses collected in the wild.
arXiv Detail & Related papers (2021-06-17T09:09:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.