Related papers: Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation

Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation

URL: http://arxiv.org/abs/2405.01527v2
Date: Thu, 8 Aug 2024 23:18:08 GMT
Title: Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation
Authors: Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, Shubham Tulsiani,
Abstract summary: We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal. We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
Score: 65.46610405509338
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation: interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop plan by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations. We show that this approach of combining scalably learned track prediction with a residual policy requiring minimal in-domain robot-specific data enables diverse generalizable robot manipulation, and present a wide array of real-world robot manipulation results across unseen tasks, objects, and scenes. https://homangab.github.io/track2act/

Related papers

Vidar: Embodied Video Diffusion Model for Generalist Bimanual Manipulation [21.424029706788883]
We introduce Video Diffusion for Action Reasoning (Vidar)<n>We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms.<n>With only 20 minutes of human demonstrations on an unseen robot platform, Vidar generalizes to unseen tasks and backgrounds with strong semantic understanding.
arXiv Detail & Related papers (2025-07-17T08:31:55Z)
3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model [40.730112146035076]
A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills.<n>Current robot datasets often record robot action in different action spaces within a simple scene.<n>We learn a 3D flow world model from both human and robot manipulation data.
arXiv Detail & Related papers (2025-06-06T16:00:31Z)
ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos [15.809468471562537]
ZeroMimic generates image goal-conditioned skill policies for several common manipulation tasks. We evaluate ZeroMimic's out-of-the-box performance in varied real-world and simulated kitchen settings. To enable plug-and-play reuse of ZeroMimic policies on other task setups and robots, we release software and policy checkpoints.
arXiv Detail & Related papers (2025-03-31T09:27:00Z)
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z)
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation [74.70013315714336]
Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.
arXiv Detail & Related papers (2024-09-24T17:57:33Z)
Vision-based Manipulation from Single Human Video with Open-World Object Graphs [58.23098483464538]
We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos. We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video.
arXiv Detail & Related papers (2024-05-30T17:56:54Z)
Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans [58.27029676638521]
We show how passive human videos can serve as a rich source of data for learning such generalist robots. We learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations. We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects.
arXiv Detail & Related papers (2023-12-01T18:54:12Z)
Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills. We learn our policy to generate appropriate actions given current scene observations and a video of the target task. We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z)
Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos. Our framework is based on predicting plausible human hand trajectories. We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.