Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
- URL: http://arxiv.org/abs/2209.05451v1
- Date: Mon, 12 Sep 2022 17:51:05 GMT
- Title: Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
- Authors: Mohit Shridhar, Lucas Manuelli, Dieter Fox
- Abstract summary: We develop PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation.
PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by "detecting the next best voxel action"
Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.
- Score: 52.94101901600948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have revolutionized vision and natural language processing with
their ability to scale with large datasets. But in robotic manipulation, data
is both limited and expensive. Can we still benefit from Transformers with the
right problem formulation? We investigate this question with PerAct, a
language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation.
PerAct encodes language goals and RGB-D voxel observations with a Perceiver
Transformer, and outputs discretized actions by "detecting the next best voxel
action". Unlike frameworks that operate on 2D images, the voxelized observation
and action space provides a strong structural prior for efficiently learning
6-DoF policies. With this formulation, we train a single multi-task Transformer
for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18
variations) from just a few demonstrations per task. Our results show that
PerAct significantly outperforms unstructured image-to-action agents and 3D
ConvNet baselines for a wide range of tabletop tasks.
Related papers
- Diffusion Transformer Policy: Scaling Diffusion Transformer for Generalist Vision-Language-Action Learning [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.
By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z) - Autoregressive Action Sequence Learning for Robotic Manipulation [32.9580007141312]
Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling.
We extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step.
We propose the Autoregressive Policy architecture, which solves manipulation tasks by generating hybrid action sequences.
arXiv Detail & Related papers (2024-10-04T04:07:15Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.
First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.
We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks [0.0]
In this work, we focus on unsupervised vision-language--action mapping in the area of robotic manipulation.
We propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%.
Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories.
arXiv Detail & Related papers (2024-04-02T13:25:16Z) - Imitating Task and Motion Planning with Visuomotor Transformers [71.41938181838124]
Task and Motion Planning (TAMP) can autonomously generate large-scale datasets of diverse demonstrations.
In this work, we show that the combination of large-scale datasets generated by TAMP supervisors and flexible Transformer models to fit them is a powerful paradigm for robot manipulation.
We present a novel imitation learning system called OPTIMUS that trains large-scale visuomotor Transformer policies by imitating a TAMP agent.
arXiv Detail & Related papers (2023-05-25T17:58:14Z) - Instruction-driven history-aware policies for robotic manipulations [82.25511767738224]
We propose a unified transformer-based approach that takes into account multiple inputs.
In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations.
We evaluate our method on the challenging RLBench benchmark and on a real-world robot.
arXiv Detail & Related papers (2022-09-11T16:28:25Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.