Related papers: OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

URL: http://arxiv.org/abs/2410.11792v1
Date: Tue, 15 Oct 2024 17:17:54 GMT
Title: OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation
Authors: Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, Yuke Zhu,
Abstract summary: We introduce OKAMI, a method that generates a manipulation plan from a single RGB-D video. OKAMI uses open-world vision models to identify task-relevant objects and retarget the body motions and hand poses separately.
Score: 35.97702591413093
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the problem of teaching humanoid robots manipulation skills by imitating from single video demonstrations. We introduce OKAMI, a method that generates a manipulation plan from a single RGB-D video and derives a policy for execution. At the heart of our approach is object-aware retargeting, which enables the humanoid robot to mimic the human motions in an RGB-D video while adjusting to different object locations during deployment. OKAMI uses open-world vision models to identify task-relevant objects and retarget the body motions and hand poses separately. Our experiments show that OKAMI achieves strong generalizations across varying visual and spatial conditions, outperforming the state-of-the-art baseline on open-world imitation from observation. Furthermore, OKAMI rollout trajectories are leveraged to train closed-loop visuomotor policies, which achieve an average success rate of 79.2% without the need for labor-intensive teleoperation. More videos can be found on our website https://ut-austin-rpl.github.io/OKAMI/.

Related papers

DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy [33.18108154271181]
We propose DemoDiffusion, a simple and scalable method for enabling robots to perform manipulation tasks in natural environments.<n>Our approach is based on two key insights. First, the hand motion in a human demonstration provides a useful prior for the robot's end-effector trajectory.<n>Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context.
arXiv Detail & Related papers (2025-06-25T17:59:01Z)
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z)
IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI [28.160367249993318]
Image-GOal Representations (IGOR) learns a unified, semantically consistent action space across human and various robots. IGOR enables knowledge transfer among large-scale robot and human activity data. We believe IGOR opens new possibilities for human-to-robot knowledge transfer and control.
arXiv Detail & Related papers (2024-10-17T13:41:16Z)
Vision-based Manipulation from Single Human Video with Open-World Object Graphs [58.23098483464538]
We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos. We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video.
arXiv Detail & Related papers (2024-05-30T17:56:54Z)
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal. We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z)
Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z)
Learning to Act from Actionless Videos through Dense Correspondences [87.1243107115642]
We present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments. Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks.
arXiv Detail & Related papers (2023-10-12T17:59:23Z)
Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills. We learn our policy to generate appropriate actions given current scene observations and a video of the target task. We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z)
Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos. Our framework is based on predicting plausible human hand trajectories. We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.