Human3R: Everyone Everywhere All at Once
- URL: http://arxiv.org/abs/2510.06219v1
- Date: Tue, 07 Oct 2025 17:59:52 GMT
- Title: Human3R: Everyone Everywhere All at Once
- Authors: Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, Gerard Pons-Moll,
- Abstract summary: We present Human3R, a feed-forward framework for online 4D human-scene reconstruction from monocular videos.<n>Human3R is a unified model that eliminates heavy dependencies and iterative refinement.<n>It delivers superior performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation.
- Score: 69.16576238974876
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scene ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R's rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily extended for downstream applications.Code available in https://fanegg.github.io/Human3R
Related papers
- SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering [6.706168135661958]
State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets.<n>We propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering.
arXiv Detail & Related papers (2025-11-11T14:28:43Z) - HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction [15.368018463074058]
HAMSt3R is an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated images.<n>Our method incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments.
arXiv Detail & Related papers (2025-08-22T14:43:18Z) - HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers [60.86393841247567]
HumanRAM is a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images.<n>Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions.<n> Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets.
arXiv Detail & Related papers (2025-06-03T17:50:05Z) - ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos [18.73641648585445]
Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses.<n>We introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion.<n>Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully.
arXiv Detail & Related papers (2025-04-17T17:59:02Z) - Reconstructing Humans with a Biomechanically Accurate Skeleton [55.06027148976482]
We introduce a method for reconstructing 3D humans from a single image using a biomechanically accurate skeleton model.<n>Compared to state-of-the-art methods for 3D human mesh recovery, our model achieves competitive performance on standard benchmarks.
arXiv Detail & Related papers (2025-03-27T17:56:24Z) - Reconstructing People, Places, and Cameras [57.81696692335401]
"Humans and Structure from Motion" (HSfM) is a method for jointly reconstructing multiple human meshes, scene point clouds, and camera parameters in a metric world coordinate system.<n>Our results show that incorporating human data into the SfM pipeline improves camera pose estimation.
arXiv Detail & Related papers (2024-12-23T18:58:34Z) - Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot [22.848563931757962]
We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image.
Predictions encompass the whole body, including hands and facial expressions, using the SMPL-X parametric model.
We show that incorporating it into the training data further enhances predictions, particularly for hands.
arXiv Detail & Related papers (2024-02-22T16:05:13Z) - WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion [43.95997922499137]
WHAM (World-grounded Humans with Accurate Motion) reconstructs 3D human motion in a global coordinate system from video.
Uses camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory.
outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks.
arXiv Detail & Related papers (2023-12-12T18:57:46Z) - Decoupling Human and Camera Motion from Videos in the Wild [67.39432972193929]
We propose a method to reconstruct global human trajectories from videos in the wild.
Our method decouples the camera and human motion, which allows us to place people in the same world coordinate frame.
arXiv Detail & Related papers (2023-02-24T18:59:15Z) - 3D Segmentation of Humans in Point Clouds with Synthetic Data [21.518379214837278]
We propose the task of joint 3D human semantic segmentation, instance segmentation and multi-human body-part segmentation.
We propose a framework for generating training data of synthetic humans interacting with real 3D scenes.
We also propose a novel transformer-based model, Human3D, which is the first end-to-end model for segmenting multiple human instances and their body-parts.
arXiv Detail & Related papers (2022-12-01T18:59:21Z) - Human POSEitioning System (HPS): 3D Human Pose Estimation and
Self-localization in Large Scenes from Body-Mounted Sensors [71.29186299435423]
We introduce (HPS) Human POSEitioning System, a method to recover the full 3D pose of a human registered with a 3D scan of the surrounding environment.
We show that our optimization-based integration exploits the benefits of the two, resulting in pose accuracy free of drift.
HPS could be used for VR/AR applications where humans interact with the scene without requiring direct line of sight with an external camera.
arXiv Detail & Related papers (2021-03-31T17:58:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.