Related papers: Self-Supervised Learning of Action Affordances as Interaction Modes

Self-Supervised Learning of Action Affordances as Interaction Modes

URL: http://arxiv.org/abs/2305.17565v1
Date: Sat, 27 May 2023 19:58:11 GMT
Title: Self-Supervised Learning of Action Affordances as Interaction Modes
Authors: Liquan Wang, Nikita Dvornik, Rafael Dubeau, Mayank Mittal, Animesh Garg
Abstract summary: In this work, we tackle unsupervised learning of priors of useful interactions with articulated objects. We use no supervision or privileged information; we only assume access to the depth sensor in the simulator to learn the interaction modes. We show that our model covers most of the human interaction modes, outperforms existing state-of-the-art methods for affordance learning, and can generalize to objects never seen during training.
Score: 25.16302650076381
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When humans perform a task with an articulated object, they interact with the object only in a handful of ways, while the space of all possible interactions is nearly endless. This is because humans have prior knowledge about what interactions are likely to be successful, i.e., to open a new door we first try the handle. While learning such priors without supervision is easy for humans, it is notoriously hard for machines. In this work, we tackle unsupervised learning of priors of useful interactions with articulated objects, which we call interaction modes. In contrast to the prior art, we use no supervision or privileged information; we only assume access to the depth sensor in the simulator to learn the interaction modes. More precisely, we define a successful interaction as the one changing the visual environment substantially and learn a generative model of such interactions, that can be conditioned on the desired goal state of the object. In our experiments, we show that our model covers most of the human interaction modes, outperforms existing state-of-the-art methods for affordance learning, and can generalize to objects never seen during training. Additionally, we show promising results in the goal-conditional setup, where our model can be quickly fine-tuned to perform a given task. We show in the experiments that such affordance learning predicts interaction which covers most modes of interaction for the querying articulated object and can be fine-tuned to a goal-conditional model. For supplementary: https://actaim.github.io.

Related papers

AdaManip: Adaptive Articulated Object Manipulation Environments and Policy Learning [25.331956706253614]
Articulated object manipulation is a critical capability for robots to perform various tasks in real-world scenarios. Previous datasets and simulation environments for articulated objects have primarily focused on simple manipulation mechanisms. We build a novel articulated object manipulation environment and equip it with 9 categories of objects. Based on the environment and objects, we propose an adaptive demonstration collection and 3D visual diffusion-based imitation learning pipeline.
arXiv Detail & Related papers (2025-02-16T13:45:10Z)
Controlling the World by Sleight of Hand [26.874176292105556]
We learn an action-conditional generative models by learning from unlabeled videos of human hands interacting with objects. Given an image, and the shape/location of a desired hand interaction, CosHand, synthesizes an image of a future after the interaction has occurred. Experiments show that the resulting model can predict the effects of hand-object interactions well.
arXiv Detail & Related papers (2024-08-13T18:33:45Z)
Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction. The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z)
H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations. We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework. We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z)
A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation [63.1610540170754]
We focus on the problem of visual non-prehensile planar manipulation. We propose a novel architecture that combines video decoding neural models with priors from contact mechanics. We find that our modular and fully differentiable architecture performs better than learning-only methods on unseen objects and motions.
arXiv Detail & Related papers (2021-11-09T18:39:45Z)
Coarse-to-Fine Imitation Learning: Robot Manipulation from a Single Demonstration [8.57914821832517]
We introduce a simple new method for visual imitation learning, which allows a novel robot manipulation task to be learned from a single human demonstration. Our method models imitation learning as a state estimation problem, with the state defined as the end-effector's pose. At test time, the end-effector moves to the estimated state through a linear path, at which point the original demonstration's end-effector velocities are simply replayed.
arXiv Detail & Related papers (2021-05-13T16:36:55Z)
Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching. Our approach learns entirely using offline, unlabeled data. We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z)
Learning Dexterous Grasping with Object-Centric Visual Affordances [86.49357517864937]
Dexterous robotic hands are appealing for their agility and human-like morphology. We introduce an approach for learning dexterous grasping. Our key idea is to embed an object-centric visual affordance model within a deep reinforcement learning loop.
arXiv Detail & Related papers (2020-09-03T04:00:40Z)
Visual Prediction of Priors for Articulated Object Interaction [37.759459329701194]
Humans are able to build on prior experience quickly and efficiently. Adults also exhibit this behavior when entering new spaces such as kitchens. We develop a method, Contextual Prior Prediction, which provides a means of transferring knowledge between interactions in similar domains through vision.
arXiv Detail & Related papers (2020-06-06T21:17:03Z)
Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs. Our network predicts interaction points, which directly localize and classify the inter-action. Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.