Understanding 3D Object Interaction from a Single Image
- URL: http://arxiv.org/abs/2305.09664v2
- Date: Fri, 4 Aug 2023 20:29:58 GMT
- Title: Understanding 3D Object Interaction from a Single Image
- Authors: Shengyi Qian, David F. Fouhey
- Abstract summary: Humans can easily understand a single image as depicting multiple potential objects permitting interaction.
We would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects.
- Score: 18.681222155879656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans can easily understand a single image as depicting multiple potential
objects permitting interaction. We use this skill to plan our interactions with
the world and accelerate understanding new objects without engaging in
interaction. In this paper, we would like to endow machines with the similar
ability, so that intelligent agents can better explore the 3D scene or
manipulate objects. Our approach is a transformer-based model that predicts the
3D location, physical properties and affordance of objects. To power this
model, we collect a dataset with Internet videos, egocentric videos and indoor
images to train and validate our approach. Our model yields strong performance
on our data, and generalizes well to robotics data. Project site:
https://jasonqsy.github.io/3DOI/
Related papers
- ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Perceiving Unseen 3D Objects by Poking the Objects [45.70559270947074]
We propose a poking-based approach that automatically discovers and reconstructs 3D objects.
The poking process not only enables the robot to discover unseen 3D objects but also produces multi-view observations.
The experiments on real-world data show that our approach could unsupervisedly discover and reconstruct unseen 3D objects with high quality.
arXiv Detail & Related papers (2023-02-26T18:22:13Z) - PIZZA: A Powerful Image-only Zero-Shot Zero-CAD Approach to 6 DoF
Tracking [27.283648727847268]
We present a method for tracking the 6D motion of objects in RGB video sequences when neither the training images nor the 3D geometry of the objects are available.
In contrast to previous works, our method can therefore consider unknown objects in open world instantly.
Our results on challenging datasets are on par with previous works that require much more information.
arXiv Detail & Related papers (2022-09-15T19:55:13Z) - BEHAVE: Dataset and Method for Tracking Human Object Interactions [105.77368488612704]
We present the first full body human- object interaction dataset with multi-view RGBD frames and corresponding 3D SMPL and object fits along with the annotated contacts between them.
We use this data to learn a model that can jointly track humans and objects in natural environments with an easy-to-use portable multi-camera setup.
arXiv Detail & Related papers (2022-04-14T13:21:19Z) - Estimating 3D Motion and Forces of Human-Object Interactions from
Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video.
Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z) - D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Hindsight for Foresight: Unsupervised Structured Dynamics Models from
Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects.
We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.