ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos
- URL: http://arxiv.org/abs/2504.13167v2
- Date: Fri, 18 Apr 2025 17:00:33 GMT
- Title: ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos
- Authors: Zetong Zhang, Manuel Kaufmann, Lixin Xue, Jie Song, Martin R. Oswald,
- Abstract summary: Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses.<n>We introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion.<n>Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully.
- Score: 18.73641648585445
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Creating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in camera tracking, human pose estimation, novel view synthesis and runtime. Our project page is at https://eth-ait.github.io/ODHSR.
Related papers
- Realistic Clothed Human and Object Joint Reconstruction from a Single Image [26.57698106821237]
We introduce a novel implicit approach for jointly reconstructing realistic 3D clothed humans and objects from a monocular view.<n>For the first time, we model both the human and the object with an implicit representation, allowing to capture more realistic details such as clothing.
arXiv Detail & Related papers (2025-02-25T12:26:36Z) - WonderHuman: Hallucinating Unseen Parts in Dynamic 3D Human Reconstruction [51.22641018932625]
We present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis.<n>Our method achieves SOTA performance in producing photorealistic renderings from the given monocular video.
arXiv Detail & Related papers (2025-02-03T04:43:41Z) - Joint Optimization for 4D Human-Scene Reconstruction in the Wild [59.322951972876716]
We propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos.<n>Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction.<n>We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos.
arXiv Detail & Related papers (2025-01-04T01:53:51Z) - HINT: Learning Complete Human Neural Representations from Limited Viewpoints [69.76947323932107]
We propose a NeRF-based algorithm able to learn a detailed and complete human model from limited viewing angles.
As a result, our method can reconstruct complete humans even from a few viewing angles, increasing performance by more than 15% PSNR.
arXiv Detail & Related papers (2024-05-30T05:43:09Z) - SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion [35.73448283467723]
SiTH is a novel pipeline that integrates an image-conditioned diffusion model into a 3D mesh reconstruction workflow.
We employ a powerful generative diffusion model to hallucinate unseen back-view appearance based on the input images.
For the latter, we leverage skinned body meshes as guidance to recover full-body texture meshes from the input and back-view images.
arXiv Detail & Related papers (2023-11-27T14:22:07Z) - Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars [18.55354901614876]
We propose Animatable 3D Gaussian, which learns human avatars from input images and poses.
On both novel view synthesis and novel pose synthesis tasks, our method achieves higher reconstruction quality than InstantAvatar with less training time.
Our method can be easily extended to multi-human scenes and achieve comparable novel view synthesis results on a scene with ten people in only 25 seconds of training.
arXiv Detail & Related papers (2023-11-27T08:17:09Z) - Humans in 4D: Reconstructing and Tracking Humans with Transformers [72.50856500760352]
We present an approach to reconstruct humans and track them over time.
At the core of our approach, we propose a fully "transformerized" version of a network for human mesh recovery.
This network, HMR 2.0, advances the state of the art and shows the capability to analyze unusual poses that have in the past been difficult to reconstruct from single images.
arXiv Detail & Related papers (2023-05-31T17:59:52Z) - Compositional 3D Human-Object Neural Animation [93.38239238988719]
Human-object interactions (HOIs) are crucial for human-centric scene understanding applications such as human-centric visual generation, AR/VR, and robotics.
In this paper, we address this challenge in HOI animation from a compositional perspective.
We adopt neural human-object deformation to model and render HOI dynamics based on implicit neural representations.
arXiv Detail & Related papers (2023-04-27T10:04:56Z) - Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via
Self-supervised Scene Decomposition [40.46674919612935]
We present Vid2Avatar, a method to learn human avatars from monocular in-the-wild videos.
Our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans.
It solves the tasks of scene decomposition and surface reconstruction directly in 3D by modeling both the human and the background in the scene jointly.
arXiv Detail & Related papers (2023-02-22T18:59:17Z) - Animatable Neural Radiance Fields from Monocular RGB Video [72.6101766407013]
We present animatable neural radiance fields for detailed human avatar creation from monocular videos.
Our approach extends neural radiance fields to the dynamic scenes with human movements via introducing explicit pose-guided deformation.
In experiments we show that the proposed approach achieves 1) implicit human geometry and appearance reconstruction with high-quality details, 2) photo-realistic rendering of the human from arbitrary views, and 3) animation of the human with arbitrary poses.
arXiv Detail & Related papers (2021-06-25T13:32:23Z) - Holistic 3D Human and Scene Mesh Estimation from Single View Images [5.100152971410397]
We propose an end-to-end trainable model that perceives the 3D scene from a single RGB image.
We show that our model outperforms existing human body mesh methods and indoor scene reconstruction methods.
arXiv Detail & Related papers (2020-12-02T23:22:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.