Reconstructing Action-Conditioned Human-Object Interactions Using
Commonsense Knowledge Priors
- URL: http://arxiv.org/abs/2209.02485v1
- Date: Tue, 6 Sep 2022 13:32:55 GMT
- Title: Reconstructing Action-Conditioned Human-Object Interactions Using
Commonsense Knowledge Priors
- Authors: Xi Wang, Gen Li, Yen-Ling Kuo, Muhammed Kocabas, Emre Aksan, Otmar
Hilliges
- Abstract summary: We present a method for inferring diverse 3D models of human-object interactions from images.
Our method extracts high-level commonsense knowledge from large language models.
We quantitatively evaluate the inferred 3D models on a large human-object interaction dataset.
- Score: 42.17542596399014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method for inferring diverse 3D models of human-object
interactions from images. Reasoning about how humans interact with objects in
complex scenes from a single 2D image is a challenging task given ambiguities
arising from the loss of information through projection. In addition, modeling
3D interactions requires the generalization ability towards diverse object
categories and interaction types. We propose an action-conditioned modeling of
interactions that allows us to infer diverse 3D arrangements of humans and
objects without supervision on contact regions or 3D scene geometry. Our method
extracts high-level commonsense knowledge from large language models (such as
GPT-3), and applies them to perform 3D reasoning of human-object interactions.
Our key insight is priors extracted from large language models can help in
reasoning about human-object contacts from textural prompts only. We
quantitatively evaluate the inferred 3D models on a large human-object
interaction dataset and show how our method leads to better 3D reconstructions.
We further qualitatively evaluate the effectiveness of our method on real
images and demonstrate its generalizability towards interaction types and
object categories.
Related papers
- Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer [58.98785899556135]
We present a novel joint 3D human-object reconstruction method (CONTHO) that effectively exploits contact information between humans and objects.
There are two core designs in our system: 1) 3D-guided contact estimation and 2) contact-based 3D human and object refinement.
arXiv Detail & Related papers (2024-04-07T06:01:49Z) - Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models [8.933560282929726]
We introduce a novel affordance representation, named Comprehensive Affordance (ComA)
Given a 3D object mesh, ComA models the distribution of relative orientation and proximity of vertices in interacting human meshes.
We demonstrate that ComA outperforms competitors that rely on human annotations in modeling contact-based affordance.
arXiv Detail & Related papers (2024-01-23T18:59:59Z) - Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation [38.08445005326031]
We propose ProciGen to procedurally generate datasets with both, plausible interaction and diverse object variation.
We generate 1M+ human-object interaction pairs in 3D and leverage this large-scale data to train our HDM (Procedural Diffusion Model)
Our HDM is an image-conditioned diffusion model that learns both realistic interaction and highly accurate human and object shapes.
arXiv Detail & Related papers (2023-12-12T08:32:55Z) - Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes.
Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene.
Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z) - CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from
Unbounded Synthesized Images [10.4286198282079]
We present a method for teaching machines to understand and model the underlying spatial common sense of diverse human-object interactions in 3D.
We show multiple 2D images captured from different viewpoints when humans interact with the same type of objects.
Despite its imperfection of the image quality over real images, we demonstrate that the synthesized images are sufficient to learn the 3D human-object spatial relations.
arXiv Detail & Related papers (2023-08-23T17:59:11Z) - Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space.
Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos.
We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
arXiv Detail & Related papers (2023-03-18T15:37:35Z) - Full-Body Articulated Human-Object Interaction [61.01135739641217]
CHAIRS is a large-scale motion-captured f-AHOI dataset consisting of 16.2 hours of versatile interactions.
CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process.
By learning the geometrical relationships in HOI, we devise the very first model that leverage human pose estimation.
arXiv Detail & Related papers (2022-12-20T19:50:54Z) - Estimating 3D Motion and Forces of Human-Object Interactions from
Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video.
Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z) - VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating
3D ARTiculated Objects [19.296344218177534]
The space of 3D articulated objects is exceptionally rich in their myriad semantic categories, diverse shape geometry, and complicated part functionality.
Previous works mostly abstract kinematic structure with estimated joint parameters and part poses as the visual representations for manipulating 3D articulated objects.
We propose object-centric actionable visual priors as a novel perception-interaction handshaking point that the perception system outputs more actionable guidance than kinematic structure estimation.
arXiv Detail & Related papers (2021-06-28T07:47:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.