PLEX: Making the Most of the Available Data for Robotic Manipulation
Pretraining
- URL: http://arxiv.org/abs/2303.08789v2
- Date: Wed, 8 Nov 2023 22:32:39 GMT
- Title: PLEX: Making the Most of the Available Data for Robotic Manipulation
Pretraining
- Authors: Garrett Thomas, Ching-An Cheng, Ricky Loynd, Felipe Vieira Frujeri,
Vibhav Vineet, Mihai Jalobeanu, Andrey Kolobov
- Abstract summary: We propose a transformer-based architecture that learns from a small amount of task-agnostic visuomotor trajectories.
In particular, using relative positional encoding in plex's transformers greatly helps in low-data regimes of learning from human-collected demonstrations.
- Score: 28.504762473732296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A rich representation is key to general robotic manipulation, but existing
approaches to representation learning require large amounts of multimodal
demonstrations. In this work we propose PLEX, a transformer-based architecture
that learns from a small amount of task-agnostic visuomotor trajectories and a
much larger amount of task-conditioned object manipulation videos -- a type of
data available in quantity. PLEX uses visuomotor trajectories to induce a
latent feature space and to learn task-agnostic manipulation routines, while
diverse video-only demonstrations teach PLEX how to plan in the induced latent
feature space for a wide variety of tasks. Experiments showcase PLEX's
generalization on Meta-World and SOTA performance in challenging Robosuite
environments. In particular, using relative positional encoding in PLEX's
transformers greatly helps in low-data regimes of learning from human-collected
demonstrations. The paper's accompanying code and data are available at
https://microsoft.github.io/PLEX.
Related papers
- LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.
First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.
We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - JUICER: Data-Efficient Imitation Learning for Robotic Assembly [21.43402768760014]
This paper proposes a pipeline for improving imitation learning performance with a small human demonstration budget.
Our pipeline combines expressive policy architectures and various techniques for dataset expansion and simulation-based data augmentation.
We demonstrate our pipeline on four furniture assembly tasks in simulation, enabling a manipulator to assemble up to five parts over nearly 2500 time steps.
arXiv Detail & Related papers (2024-04-04T18:00:15Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - Imitating Task and Motion Planning with Visuomotor Transformers [71.41938181838124]
Task and Motion Planning (TAMP) can autonomously generate large-scale datasets of diverse demonstrations.
In this work, we show that the combination of large-scale datasets generated by TAMP supervisors and flexible Transformer models to fit them is a powerful paradigm for robot manipulation.
We present a novel imitation learning system called OPTIMUS that trains large-scale visuomotor Transformer policies by imitating a TAMP agent.
arXiv Detail & Related papers (2023-05-25T17:58:14Z) - VIMA: General Robot Manipulation with Multimodal Prompts [82.01214865117637]
We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts.
We develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks.
We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively.
arXiv Detail & Related papers (2022-10-06T17:50:11Z) - Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [52.94101901600948]
We develop PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation.
PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by "detecting the next best voxel action"
Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.
arXiv Detail & Related papers (2022-09-12T17:51:05Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.