Perceiving 3D Human-Object Spatial Arrangements from a Single Image in
the Wild
- URL: http://arxiv.org/abs/2007.15649v2
- Date: Wed, 19 Aug 2020 20:17:49 GMT
- Title: Perceiving 3D Human-Object Spatial Arrangements from a Single Image in
the Wild
- Authors: Jason Y. Zhang and Sam Pepose and Hanbyul Joo and Deva Ramanan and
Jitendra Malik and Angjoo Kanazawa
- Abstract summary: We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene.
Our method runs on datasets without any scene- or object-level 3D supervision.
- Score: 96.08358373137438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method that infers spatial arrangements and shapes of humans and
objects in a globally consistent 3D scene, all from a single image in-the-wild
captured in an uncontrolled environment. Notably, our method runs on datasets
without any scene- or object-level 3D supervision. Our key insight is that
considering humans and objects jointly gives rise to "3D common sense"
constraints that can be used to resolve ambiguity. In particular, we introduce
a scale loss that learns the distribution of object size from data; an
occlusion-aware silhouette re-projection loss to optimize object pose; and a
human-object interaction loss to capture the spatial layout of objects with
which humans interact. We empirically validate that our constraints
dramatically reduce the space of likely 3D spatial configurations. We
demonstrate our approach on challenging, in-the-wild images of humans
interacting with large objects (such as bicycles, motorcycles, and surfboards)
and handheld objects (such as laptops, tennis rackets, and skateboards). We
quantify the ability of our approach to recover human-object arrangements and
outline remaining challenges in this relatively domain. The project webpage can
be found at https://jasonyzhang.com/phosa.
Related papers
- StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset [56.71580976007712]
We propose to use the Human-Object Offset between anchors which are densely sampled from the surface of human mesh and object mesh to represent human-object spatial relation.
Based on this representation, we propose Stacked Normalizing Flow (StackFLOW) to infer the posterior distribution of human-object spatial relations from the image.
During the optimization stage, we finetune the human body pose and object 6D pose by maximizing the likelihood of samples.
arXiv Detail & Related papers (2024-07-30T04:57:21Z) - Human-Aware 3D Scene Generation with Spatially-constrained Diffusion Models [16.259040755335885]
Previous auto-regression-based 3D scene generation methods have struggled to accurately capture the joint distribution of multiple objects and input humans.
We introduce two spatial collision guidance mechanisms: human-object collision avoidance and object-room boundary constraints.
Our framework can generate more natural and plausible 3D scenes with precise human-scene interactions.
arXiv Detail & Related papers (2024-06-26T08:18:39Z) - CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from
Unbounded Synthesized Images [10.4286198282079]
We present a method for teaching machines to understand and model the underlying spatial common sense of diverse human-object interactions in 3D.
We show multiple 2D images captured from different viewpoints when humans interact with the same type of objects.
Despite its imperfection of the image quality over real images, we demonstrate that the synthesized images are sufficient to learn the 3D human-object spatial relations.
arXiv Detail & Related papers (2023-08-23T17:59:11Z) - FLEX: Full-Body Grasping Without Full-Body Grasps [24.10724524386518]
We address the task of generating a virtual human -- hands and full body -- grasping everyday objects.
Existing methods approach this problem by collecting a 3D dataset of humans interacting with objects and training on this data.
We leverage the existence of both full-body pose and hand grasping priors, composing them using 3D geometrical constraints to obtain full-body grasps.
arXiv Detail & Related papers (2022-11-21T23:12:54Z) - Human-Aware Object Placement for Visual Environment Reconstruction [63.14733166375534]
We show that human-scene interactions can be leveraged to improve the 3D reconstruction of a scene from a monocular RGB video.
Our key idea is that, as a person moves through a scene and interacts with it, we accumulate HSIs across multiple input images.
We show that our scene reconstruction can be used to refine the initial 3D human pose and shape estimation.
arXiv Detail & Related papers (2022-03-07T18:59:02Z) - D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z) - Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in
Time [22.574069344246052]
We propose a unified framework for estimating the 3D hand and object poses with semi-supervised learning.
We build a joint learning framework where we perform explicit contextual reasoning between hand and object representations by a Transformer.
Our method not only improves hand pose estimation in challenging real-world dataset, but also substantially improve the object pose which has fewer ground-truths per instance.
arXiv Detail & Related papers (2021-06-09T17:59:34Z) - Reconstructing Hand-Object Interactions in the Wild [71.16013096764046]
We propose an optimization-based procedure which does not require direct 3D supervision.
We exploit all available related data (2D bounding boxes, 2D hand keypoints, 2D instance masks, 3D object models, 3D in-the-lab MoCap) to provide constraints for the 3D reconstruction.
Our method produces compelling reconstructions on the challenging in-the-wild data from the EPIC Kitchens and the 100 Days of Hands datasets.
arXiv Detail & Related papers (2020-12-17T18:59:58Z) - Chained Representation Cycling: Learning to Estimate 3D Human Pose and
Shape by Cycling Between Representations [73.11883464562895]
We propose a new architecture that facilitates unsupervised, or lightly supervised, learning.
We demonstrate the method by learning 3D human pose and shape from un-paired and un-annotated images.
While we present results for modeling humans, our formulation is general and can be applied to other vision problems.
arXiv Detail & Related papers (2020-01-06T14:54:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.