Transformers for One-Shot Visual Imitation
- URL: http://arxiv.org/abs/2011.05970v1
- Date: Wed, 11 Nov 2020 18:41:07 GMT
- Title: Transformers for One-Shot Visual Imitation
- Authors: Sudeep Dasari, Abhinav Gupta
- Abstract summary: Humans are able to seamlessly visually imitate others, by inferring their intentions and using past experience to achieve the same end goal.
Prior research in robot imitation learning has created agents which can acquire diverse skills from expert human operators.
This paper investigates techniques which allow robots to partially bridge these domain gaps, using their past experience.
- Score: 28.69615089950047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans are able to seamlessly visually imitate others, by inferring their
intentions and using past experience to achieve the same end goal. In other
words, we can parse complex semantic knowledge from raw video and efficiently
translate that into concrete motor control. Is it possible to give a robot this
same capability? Prior research in robot imitation learning has created agents
which can acquire diverse skills from expert human operators. However,
expanding these techniques to work with a single positive example during test
time is still an open challenge. Apart from control, the difficulty stems from
mismatches between the demonstrator and robot domains. For example, objects may
be placed in different locations (e.g. kitchen layouts are different in every
house). Additionally, the demonstration may come from an agent with different
morphology and physical appearance (e.g. human), so one-to-one action
correspondences are not available. This paper investigates techniques which
allow robots to partially bridge these domain gaps, using their past
experience. A neural network is trained to mimic ground truth robot actions
given context video from another agent, and must generalize to unseen task
instances when prompted with new videos during test time. We hypothesize that
our policy representations must be both context driven and dynamics aware in
order to perform these tasks. These assumptions are baked into the neural
network using the Transformers attention mechanism and a self-supervised
inverse dynamics loss. Finally, we experimentally determine that our method
accomplishes a $\sim 2$x improvement in terms of task success rate over prior
baselines in a suite of one-shot manipulation tasks.
Related papers
- Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - XSkill: Cross Embodiment Skill Discovery [41.624343257852146]
XSkill is an imitation learning framework that discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos.
Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate skill transfer and composition for unseen tasks.
arXiv Detail & Related papers (2023-07-19T12:51:28Z) - Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills.
We learn our policy to generate appropriate actions given current scene observations and a video of the target task.
We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z) - Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact.
The structure of these behavioral affordances directly enables the robot to perform many complex tasks.
We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z) - Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos.
Our framework is based on predicting plausible human hand trajectories.
We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - A Differentiable Recipe for Learning Visual Non-Prehensile Planar
Manipulation [63.1610540170754]
We focus on the problem of visual non-prehensile planar manipulation.
We propose a novel architecture that combines video decoding neural models with priors from contact mechanics.
We find that our modular and fully differentiable architecture performs better than learning-only methods on unseen objects and motions.
arXiv Detail & Related papers (2021-11-09T18:39:45Z) - Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.