Joint Optimization for 4D Human-Scene Reconstruction in the Wild
- URL: http://arxiv.org/abs/2501.02158v1
- Date: Sat, 04 Jan 2025 01:53:51 GMT
- Title: Joint Optimization for 4D Human-Scene Reconstruction in the Wild
- Authors: Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou,
- Abstract summary: We propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos.
Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction.
We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos.
- Score: 59.322951972876716
- License:
- Abstract: Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.
Related papers
- ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation [17.438484695828276]
We present ZeroHSI, a novel approach that enables zero-shot 4D human-scene interaction synthesis by integrating video generation and neural human rendering.
Our key insight is to leverage the rich motion priors learned by state-of-the-art video generation models, which have been trained on vast amounts of natural human movements and interactions, and use differentiable rendering to reconstruct human-scene interactions.
We evaluate ZeroHSI on a curated dataset of different types of various indoor and outdoor scenes with different interaction prompts, demonstrating its ability to generate diverse and contextually appropriate human-scene interactions.
arXiv Detail & Related papers (2024-12-24T18:55:38Z) - PACE: Human and Camera Motion Estimation from in-the-wild Videos [113.76041632912577]
We present a method to estimate human motion in a global scene from moving cameras.
This is a highly challenging task due to the coupling of human and camera motions in the video.
We propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features.
arXiv Detail & Related papers (2023-10-20T19:04:14Z) - Decoupling Human and Camera Motion from Videos in the Wild [67.39432972193929]
We propose a method to reconstruct global human trajectories from videos in the wild.
Our method decouples the camera and human motion, which allows us to place people in the same world coordinate frame.
arXiv Detail & Related papers (2023-02-24T18:59:15Z) - Human-Aware Object Placement for Visual Environment Reconstruction [63.14733166375534]
We show that human-scene interactions can be leveraged to improve the 3D reconstruction of a scene from a monocular RGB video.
Our key idea is that, as a person moves through a scene and interacts with it, we accumulate HSIs across multiple input images.
We show that our scene reconstruction can be used to refine the initial 3D human pose and shape estimation.
arXiv Detail & Related papers (2022-03-07T18:59:02Z) - Learning Motion Priors for 4D Human Body Capture in 3D Scenes [81.54377747405812]
We propose LEMO: LEarning human MOtion priors for 4D human body capture.
We introduce a novel motion prior, which reduces the jitters exhibited by poses recovered over a sequence.
We also design a contact friction term and a contact-aware motion infiller obtained via per-instance self-supervised training.
With our pipeline, we demonstrate high-quality 4D human body capture, reconstructing smooth motions and physically plausible body-scene interactions.
arXiv Detail & Related papers (2021-08-23T20:47:09Z) - Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes [27.443701512923177]
We propose to bridge human motion synthesis and scene affordance reasoning.
We present a hierarchical generative framework to synthesize long-term 3D human motion conditioning on the 3D scene structure.
Our experiments show significant improvements over previous approaches on generating natural and physically plausible human motion in a scene.
arXiv Detail & Related papers (2020-12-10T09:09:38Z) - Long-term Human Motion Prediction with Scene Context [60.096118270451974]
We propose a novel three-stage framework for predicting human motion.
Our method first samples multiple human motion goals, then plans 3D human paths towards each goal, and finally predicts 3D human pose sequences following each path.
arXiv Detail & Related papers (2020-07-07T17:59:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.