Self-Supervised Learning of Action Affordances as Interaction Modes
- URL: http://arxiv.org/abs/2305.17565v1
- Date: Sat, 27 May 2023 19:58:11 GMT
- Title: Self-Supervised Learning of Action Affordances as Interaction Modes
- Authors: Liquan Wang, Nikita Dvornik, Rafael Dubeau, Mayank Mittal, Animesh
Garg
- Abstract summary: In this work, we tackle unsupervised learning of priors of useful interactions with articulated objects.
We use no supervision or privileged information; we only assume access to the depth sensor in the simulator to learn the interaction modes.
We show that our model covers most of the human interaction modes, outperforms existing state-of-the-art methods for affordance learning, and can generalize to objects never seen during training.
- Score: 25.16302650076381
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When humans perform a task with an articulated object, they interact with the
object only in a handful of ways, while the space of all possible interactions
is nearly endless. This is because humans have prior knowledge about what
interactions are likely to be successful, i.e., to open a new door we first try
the handle. While learning such priors without supervision is easy for humans,
it is notoriously hard for machines. In this work, we tackle unsupervised
learning of priors of useful interactions with articulated objects, which we
call interaction modes. In contrast to the prior art, we use no supervision or
privileged information; we only assume access to the depth sensor in the
simulator to learn the interaction modes. More precisely, we define a
successful interaction as the one changing the visual environment substantially
and learn a generative model of such interactions, that can be conditioned on
the desired goal state of the object. In our experiments, we show that our
model covers most of the human interaction modes, outperforms existing
state-of-the-art methods for affordance learning, and can generalize to objects
never seen during training. Additionally, we show promising results in the
goal-conditional setup, where our model can be quickly fine-tuned to perform a
given task. We show in the experiments that such affordance learning predicts
interaction which covers most modes of interaction for the querying articulated
object and can be fine-tuned to a goal-conditional model. For supplementary:
https://actaim.github.io.
Related papers
- Controlling the World by Sleight of Hand [26.874176292105556]
We learn an action-conditional generative models by learning from unlabeled videos of human hands interacting with objects.
Given an image, and the shape/location of a desired hand interaction, CosHand, synthesizes an image of a future after the interaction has occurred.
Experiments show that the resulting model can predict the effects of hand-object interactions well.
arXiv Detail & Related papers (2024-08-13T18:33:45Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding
Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations.
We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework.
We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z) - A Differentiable Recipe for Learning Visual Non-Prehensile Planar
Manipulation [63.1610540170754]
We focus on the problem of visual non-prehensile planar manipulation.
We propose a novel architecture that combines video decoding neural models with priors from contact mechanics.
We find that our modular and fully differentiable architecture performs better than learning-only methods on unseen objects and motions.
arXiv Detail & Related papers (2021-11-09T18:39:45Z) - Coarse-to-Fine Imitation Learning: Robot Manipulation from a Single
Demonstration [8.57914821832517]
We introduce a simple new method for visual imitation learning, which allows a novel robot manipulation task to be learned from a single human demonstration.
Our method models imitation learning as a state estimation problem, with the state defined as the end-effector's pose.
At test time, the end-effector moves to the estimated state through a linear path, at which point the original demonstration's end-effector velocities are simply replayed.
arXiv Detail & Related papers (2021-05-13T16:36:55Z) - Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching.
Our approach learns entirely using offline, unlabeled data.
We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z) - Learning Dexterous Grasping with Object-Centric Visual Affordances [86.49357517864937]
Dexterous robotic hands are appealing for their agility and human-like morphology.
We introduce an approach for learning dexterous grasping.
Our key idea is to embed an object-centric visual affordance model within a deep reinforcement learning loop.
arXiv Detail & Related papers (2020-09-03T04:00:40Z) - Visual Prediction of Priors for Articulated Object Interaction [37.759459329701194]
Humans are able to build on prior experience quickly and efficiently.
Adults also exhibit this behavior when entering new spaces such as kitchens.
We develop a method, Contextual Prior Prediction, which provides a means of transferring knowledge between interactions in similar domains through vision.
arXiv Detail & Related papers (2020-06-06T21:17:03Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.