Capturing and Inferring Dense Full-Body Human-Scene Contact
- URL: http://arxiv.org/abs/2206.09553v1
- Date: Mon, 20 Jun 2022 03:31:00 GMT
- Title: Capturing and Inferring Dense Full-Body Human-Scene Contact
- Authors: Chun-Hao P. Huang, Hongwei Yi, Markus H\"oschle, Matvey Safroshkin,
Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, Michael J. Black
- Abstract summary: We train a network that predicts dense body-scene contacts from a single RGB image.
We use a transformer to learn such non-local relationships and propose a new Body-Scene contact TRansfOrmer (BSTRO)
To our knowledge, BSTRO is the first method to directly estimate 3D body-scene contact from a single image.
- Score: 40.29636308110822
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inferring human-scene contact (HSC) is the first step toward understanding
how humans interact with their surroundings. While detecting 2D human-object
interaction (HOI) and reconstructing 3D human pose and shape (HPS) have enjoyed
significant progress, reasoning about 3D human-scene contact from a single
image is still challenging. Existing HSC detection methods consider only a few
types of predefined contact, often reduce body and scene to a small number of
primitives, and even overlook image evidence. To predict human-scene contact
from a single image, we address the limitations above from both data and
algorithmic perspectives. We capture a new dataset called RICH for "Real
scenes, Interaction, Contact and Humans." RICH contains multiview
outdoor/indoor video sequences at 4K resolution, ground-truth 3D human bodies
captured using markerless motion capture, 3D body scans, and high resolution 3D
scene scans. A key feature of RICH is that it also contains accurate
vertex-level contact labels on the body. Using RICH, we train a network that
predicts dense body-scene contacts from a single RGB image. Our key insight is
that regions in contact are always occluded so the network needs the ability to
explore the whole image for evidence. We use a transformer to learn such
non-local relationships and propose a new Body-Scene contact TRansfOrmer
(BSTRO). Very few methods explore 3D contact; those that do focus on the feet
only, detect foot contact as a post-processing step, or infer contact from body
pose without looking at the scene. To our knowledge, BSTRO is the first method
to directly estimate 3D body-scene contact from a single image. We demonstrate
that BSTRO significantly outperforms the prior art. The code and dataset are
available at https://rich.is.tue.mpg.de.
Related papers
- 3D Reconstruction of Interacting Multi-Person in Clothing from a Single Image [8.900009931200955]
This paper introduces a novel pipeline to reconstruct the geometry of interacting multi-person in clothing on a globally coherent scene space from a single image.
We overcome this challenge by utilizing two human priors for complete 3D geometry and surface contacts.
The results demonstrate that our method is complete, globally coherent, and physically plausible compared to existing methods.
arXiv Detail & Related papers (2024-01-12T07:23:02Z) - DECO: Dense Estimation of 3D Human-Scene Contact In The Wild [54.44345845842109]
We train a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate contact on the SMPL body.
We significantly outperform existing SOTA methods across all benchmarks.
We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images.
arXiv Detail & Related papers (2023-09-26T21:21:07Z) - Detecting Human-Object Contact in Images [75.35017308643471]
Humans constantly contact objects to move and perform tasks.
There exists no robust method to detect contact between the body and the scene from an image.
We build a new dataset of human-object contacts for images.
arXiv Detail & Related papers (2023-03-06T18:56:26Z) - Human-Aware Object Placement for Visual Environment Reconstruction [63.14733166375534]
We show that human-scene interactions can be leveraged to improve the 3D reconstruction of a scene from a monocular RGB video.
Our key idea is that, as a person moves through a scene and interacts with it, we accumulate HSIs across multiple input images.
We show that our scene reconstruction can be used to refine the initial 3D human pose and shape estimation.
arXiv Detail & Related papers (2022-03-07T18:59:02Z) - Learning Motion Priors for 4D Human Body Capture in 3D Scenes [81.54377747405812]
We propose LEMO: LEarning human MOtion priors for 4D human body capture.
We introduce a novel motion prior, which reduces the jitters exhibited by poses recovered over a sequence.
We also design a contact friction term and a contact-aware motion infiller obtained via per-instance self-supervised training.
With our pipeline, we demonstrate high-quality 4D human body capture, reconstructing smooth motions and physically plausible body-scene interactions.
arXiv Detail & Related papers (2021-08-23T20:47:09Z) - Populating 3D Scenes by Learning Human-Scene Interaction [47.42049393299]
We learn how humans interact with scenes and leverage this to enable virtual characters to do the same.
The representation of interaction is body-centric, which enables it to generalize to new scenes.
We show that POSA's learned representation of body-scene interaction supports monocular human pose estimation.
arXiv Detail & Related papers (2020-12-21T18:57:55Z) - PLACE: Proximity Learning of Articulation and Contact in 3D Environments [70.50782687884839]
We propose a novel interaction generation method, named PLACE, which explicitly models the proximity between the human body and the 3D scene around it.
Our perceptual study shows that PLACE significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction.
arXiv Detail & Related papers (2020-08-12T21:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.