CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from
Unbounded Synthesized Images
- URL: http://arxiv.org/abs/2308.12288v2
- Date: Sun, 3 Sep 2023 23:00:47 GMT
- Title: CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from
Unbounded Synthesized Images
- Authors: Sookwan Han and Hanbyul Joo
- Abstract summary: We present a method for teaching machines to understand and model the underlying spatial common sense of diverse human-object interactions in 3D.
We show multiple 2D images captured from different viewpoints when humans interact with the same type of objects.
Despite its imperfection of the image quality over real images, we demonstrate that the synthesized images are sufficient to learn the 3D human-object spatial relations.
- Score: 10.4286198282079
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a method for teaching machines to understand and model the
underlying spatial common sense of diverse human-object interactions in 3D in a
self-supervised way. This is a challenging task, as there exist specific
manifolds of the interactions that can be considered human-like and natural,
but the human pose and the geometry of objects can vary even for similar
interactions. Such diversity makes the annotating task of 3D interactions
difficult and hard to scale, which limits the potential to reason about that in
a supervised way. One way of learning the 3D spatial relationship between
humans and objects during interaction is by showing multiple 2D images captured
from different viewpoints when humans interact with the same type of objects.
The core idea of our method is to leverage a generative model that produces
high-quality 2D images from an arbitrary text prompt input as an "unbounded"
data generator with effective controllability and view diversity. Despite its
imperfection of the image quality over real images, we demonstrate that the
synthesized images are sufficient to learn the 3D human-object spatial
relations. We present multiple strategies to leverage the synthesized images,
including (1) the first method to leverage a generative image model for 3D
human-object spatial relation learning; (2) a framework to reason about the 3D
spatial relations from inconsistent 2D cues in a self-supervised manner via 3D
occupancy reasoning with pose canonicalization; (3) semantic clustering to
disambiguate different types of interactions with the same object types; and
(4) a novel metric to assess the quality of 3D spatial learning of interaction.
Related papers
- Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding [46.05283810364663]
We introduce the textbfMulti-textbfImage Guided Invariant-textbfFeature-Aware 3D textbfAffordance textbfGrounding framework.
It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images.
arXiv Detail & Related papers (2024-08-23T12:27:33Z) - Monocular Human-Object Reconstruction in the Wild [11.261465071559163]
We present a 2D-supervised method that learns the 3D human-object spatial relation prior purely from 2D images in the wild.
Our method utilizes a flow-based neural network to learn the prior distribution of the 2D human-object keypoint layout and viewports for each image in the dataset.
arXiv Detail & Related papers (2024-07-30T05:45:06Z) - AG3D: Learning to Generate 3D Avatars from 2D Image Collections [96.28021214088746]
We propose a new adversarial generative model of realistic 3D people from 2D images.
Our method captures shape and deformation of the body and loose clothing by adopting a holistic 3D generator.
We experimentally find that our method outperforms previous 3D- and articulation-aware methods in terms of geometry and appearance.
arXiv Detail & Related papers (2023-05-03T17:56:24Z) - Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space.
Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos.
We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
arXiv Detail & Related papers (2023-03-18T15:37:35Z) - Reconstructing Action-Conditioned Human-Object Interactions Using
Commonsense Knowledge Priors [42.17542596399014]
We present a method for inferring diverse 3D models of human-object interactions from images.
Our method extracts high-level commonsense knowledge from large language models.
We quantitatively evaluate the inferred 3D models on a large human-object interaction dataset.
arXiv Detail & Related papers (2022-09-06T13:32:55Z) - Neural Novel Actor: Learning a Generalized Animatable Neural
Representation for Human Actors [98.24047528960406]
We propose a new method for learning a generalized animatable neural representation from a sparse set of multi-view imagery of multiple persons.
The learned representation can be used to synthesize novel view images of an arbitrary person from a sparse set of cameras, and further animate them with the user's pose control.
arXiv Detail & Related papers (2022-08-25T07:36:46Z) - Grasping Field: Learning Implicit Representations for Human Grasps [16.841780141055505]
We propose an expressive representation for human grasp modelling that is efficient and easy to integrate with deep neural networks.
We name this 3D to 2D mapping as Grasping Field, parameterize it with a deep neural network, and learn it from data.
Our generative model is able to synthesize high-quality human grasps, given only on a 3D object point cloud.
arXiv Detail & Related papers (2020-08-10T23:08:26Z) - Perceiving 3D Human-Object Spatial Arrangements from a Single Image in
the Wild [96.08358373137438]
We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene.
Our method runs on datasets without any scene- or object-level 3D supervision.
arXiv Detail & Related papers (2020-07-30T17:59:50Z) - Detailed 2D-3D Joint Representation for Human-Object Interaction [45.71407935014447]
We propose a detailed 2D-3D joint representation learning method for HOI learning.
First, we utilize the single-view human body capture method to obtain detailed 3D body, face and hand shapes.
Next, we estimate the 3D object location and size with reference to the 2D human-object spatial configuration and object category priors.
arXiv Detail & Related papers (2020-04-17T10:22:12Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z) - Chained Representation Cycling: Learning to Estimate 3D Human Pose and
Shape by Cycling Between Representations [73.11883464562895]
We propose a new architecture that facilitates unsupervised, or lightly supervised, learning.
We demonstrate the method by learning 3D human pose and shape from un-paired and un-annotated images.
While we present results for modeling humans, our formulation is general and can be applied to other vision problems.
arXiv Detail & Related papers (2020-01-06T14:54:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.