Related papers: Human-to-Robot Imitation in the Wild

Human-to-Robot Imitation in the Wild

URL: http://arxiv.org/abs/2207.09450v1
Date: Tue, 19 Jul 2022 17:59:59 GMT
Title: Human-to-Robot Imitation in the Wild
Authors: Shikhar Bahl, Abhinav Gupta, Deepak Pathak
Abstract summary: We propose an efficient one-shot robot learning algorithm, centered around learning from a third-person perspective. We show one-shot generalization and success in real-world settings, including 20 different manipulation tasks in the wild.
Score: 50.49660984318492
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We approach the problem of learning by watching humans in the wild. While traditional approaches in Imitation and Reinforcement Learning are promising for learning in the real world, they are either sample inefficient or are constrained to lab settings. Meanwhile, there has been a lot of success in processing passive, unstructured human data. We propose tackling this problem via an efficient one-shot robot learning algorithm, centered around learning from a third-person perspective. We call our method WHIRL: In-the-Wild Human Imitating Robot Learning. WHIRL extracts a prior over the intent of the human demonstrator, using it to initialize our agent's policy. We introduce an efficient real-world policy learning scheme that improves using interactions. Our key contributions are a simple sampling-based policy optimization approach, a novel objective function for aligning human and robot videos as well as an exploration method to boost sample efficiency. We show one-shot generalization and success in real-world settings, including 20 different manipulation tasks in the wild. Videos and talk at https://human2robot.github.io

Related papers

Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations [52.29884993824894]
Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community.<n>AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses.
arXiv Detail & Related papers (2025-11-20T18:59:02Z)
EgoZero: Robot Learning from Smart Glasses [54.6168258133554]
EgoZero learns robust manipulation policies from human demonstrations captured with Project Aria smart glasses.<n>We deploy EgoZero policies on a Franka Panda robot and demonstrate zero-shot transfer with 70% success rate over 7 manipulation tasks.<n>Our results suggest that in-the-wild human data can serve as a scalable foundation for real-world robot learning.
arXiv Detail & Related papers (2025-05-26T17:59:17Z)
Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration [21.94699075066712]
Teaching robots dexterous manipulation skills often requires collecting hundreds of demonstrations using wearables or teleoperation. We propose Human2Sim2Robot, a novel real-to-sim-to-real framework for training dexterous manipulation policies.
arXiv Detail & Related papers (2025-04-17T03:15:20Z)
Learning Strategies For Successful Crowd Navigation [0.0]
We focus on crowd navigation, using a neural network to learn specific strategies in-situ with a robot. A CNN takes a top-down image of the scene as input and outputs the next action for the robot to take in terms of speed and angle.
arXiv Detail & Related papers (2024-04-09T18:25:21Z)
Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z)
Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans [58.27029676638521]
We show how passive human videos can serve as a rich source of data for learning such generalist robots. We learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations. We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects.
arXiv Detail & Related papers (2023-12-01T18:54:12Z)
Structured World Models from Human Videos [45.08503470821952]
We tackle the problem of learning complex, general behaviors directly in the real world. We propose an approach for robots to efficiently learn manipulation skills using only a handful of real-world interaction trajectories.
arXiv Detail & Related papers (2023-08-21T17:59:32Z)
Giving Robots a Hand: Learning Generalizable Manipulation with Eye-in-Hand Human Video Demonstrations [66.47064743686953]
Eye-in-hand cameras have shown promise in enabling greater sample efficiency and generalization in vision-based robotic manipulation. Videos of humans performing tasks, on the other hand, are much cheaper to collect since they eliminate the need for expertise in robotic teleoperation. In this work, we augment narrow robotic imitation datasets with broad unlabeled human video demonstrations to greatly enhance the generalization of eye-in-hand visuomotor policies.
arXiv Detail & Related papers (2023-07-12T07:04:53Z)
Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills. We learn our policy to generate appropriate actions given current scene observations and a video of the target task. We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z)
Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact. The structure of these behavioral affordances directly enables the robot to perform many complex tasks. We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z)
Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z)
Inducing Structure in Reward Learning by Learning Features [31.413656752926208]
We introduce a novel type of human input for teaching features and an algorithm that utilizes it to learn complex features from the raw state space. We demonstrate our method in settings where all features have to be learned from scratch, as well as where some of the features are known.
arXiv Detail & Related papers (2022-01-18T16:02:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.