SHARE: Scene-Human Aligned Reconstruction
- URL: http://arxiv.org/abs/2510.15342v1
- Date: Fri, 17 Oct 2025 06:12:10 GMT
- Title: SHARE: Scene-Human Aligned Reconstruction
- Authors: Joshua Li, Brendan Chharawala, Chang Shu, Xue Bin Peng, Pengcheng Xi,
- Abstract summary: We introduce Scene-Human Aligned REconstruction, a technique that leverages the scene geometry's inherent spatial cues to accurately ground human motion reconstruction.<n>It iteratively refines the human's positions at theses by comparing the human mesh against a human point map extracted from the scene using the mask.<n>Our approach enables more accurate 3D human placement while reconstructing the surrounding scene, facilitating use cases on both curated datasets and in-the-wild web videos.
- Score: 10.764401463569442
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Animating realistic character interactions with the surrounding environment is important for autonomous agents in gaming, AR/VR, and robotics. However, current methods for human motion reconstruction struggle with accurately placing humans in 3D space. We introduce Scene-Human Aligned REconstruction (SHARE), a technique that leverages the scene geometry's inherent spatial cues to accurately ground human motion reconstruction. Each reconstruction relies solely on a monocular RGB video from a stationary camera. SHARE first estimates a human mesh and segmentation mask for every frame, alongside a scene point map at keyframes. It iteratively refines the human's positions at these keyframes by comparing the human mesh against the human point map extracted from the scene using the mask. Crucially, we also ensure that non-keyframe human meshes remain consistent by preserving their relative root joint positions to keyframe root joints during optimization. Our approach enables more accurate 3D human placement while reconstructing the surrounding scene, facilitating use cases on both curated datasets and in-the-wild web videos. Extensive experiments demonstrate that SHARE outperforms existing methods.
Related papers
- CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives [65.89192712575797]
We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video.<n>Our approach reduces motion tracking failure rates from 55.2% to 6.9% on human-centric video benchmarks.<n>This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale.
arXiv Detail & Related papers (2025-12-16T18:59:50Z) - HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction [15.368018463074058]
HAMSt3R is an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated images.<n>Our method incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments.
arXiv Detail & Related papers (2025-08-22T14:43:18Z) - Joint Optimization for 4D Human-Scene Reconstruction in the Wild [59.322951972876716]
We propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos.<n>Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction.<n>We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos.
arXiv Detail & Related papers (2025-01-04T01:53:51Z) - Visibility Aware Human-Object Interaction Tracking from Single RGB
Camera [40.817960406002506]
We propose a novel method to track the 3D human, object, contacts between them, and their relative translation across frames from a single RGB camera.
We condition our neural field reconstructions for human and object on per-frame SMPL model estimates obtained by pre-fitting SMPL to a video sequence.
Human and object motion from visible frames provides valuable information to infer the occluded object.
arXiv Detail & Related papers (2023-03-29T06:23:44Z) - Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via
Self-supervised Scene Decomposition [40.46674919612935]
We present Vid2Avatar, a method to learn human avatars from monocular in-the-wild videos.
Our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans.
It solves the tasks of scene decomposition and surface reconstruction directly in 3D by modeling both the human and the background in the scene jointly.
arXiv Detail & Related papers (2023-02-22T18:59:17Z) - Scene-Aware 3D Multi-Human Motion Capture from a Single Camera [83.06768487435818]
We consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera.
We leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks.
In particular, we estimate the scene depth and unique person scale from normalized disparity predictions using the 2D body joints and joint angles.
arXiv Detail & Related papers (2023-01-12T18:01:28Z) - Human-Aware Object Placement for Visual Environment Reconstruction [63.14733166375534]
We show that human-scene interactions can be leveraged to improve the 3D reconstruction of a scene from a monocular RGB video.
Our key idea is that, as a person moves through a scene and interacts with it, we accumulate HSIs across multiple input images.
We show that our scene reconstruction can be used to refine the initial 3D human pose and shape estimation.
arXiv Detail & Related papers (2022-03-07T18:59:02Z) - PLACE: Proximity Learning of Articulation and Contact in 3D Environments [70.50782687884839]
We propose a novel interaction generation method, named PLACE, which explicitly models the proximity between the human body and the 3D scene around it.
Our perceptual study shows that PLACE significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction.
arXiv Detail & Related papers (2020-08-12T21:00:10Z) - Long-term Human Motion Prediction with Scene Context [60.096118270451974]
We propose a novel three-stage framework for predicting human motion.
Our method first samples multiple human motion goals, then plans 3D human paths towards each goal, and finally predicts 3D human pose sequences following each path.
arXiv Detail & Related papers (2020-07-07T17:59:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.