Related papers: Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

URL: http://arxiv.org/abs/2209.05451v1
Date: Mon, 12 Sep 2022 17:51:05 GMT
Title: Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
Authors: Mohit Shridhar, Lucas Manuelli, Dieter Fox
Abstract summary: We develop PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by "detecting the next best voxel action" Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.
Score: 52.94101901600948
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can we still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by "detecting the next best voxel action". Unlike frameworks that operate on 2D images, the voxelized observation and action space provides a strong structural prior for efficiently learning 6-DoF policies. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.

Related papers

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences. Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Autoregressive Action Sequence Learning for Robotic Manipulation [32.9580007141312]
Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling. We extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step. We propose the Autoregressive Policy architecture, which solves manipulation tasks by generating hybrid action sequences.
arXiv Detail & Related papers (2024-10-04T04:07:15Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks [0.0]
In this work, we focus on unsupervised vision-language--action mapping in the area of robotic manipulation. We propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories.
arXiv Detail & Related papers (2024-04-02T13:25:16Z)
RVT: Robotic View Transformer for 3D Object Manipulation [46.25268237442356]
We propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate. A single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than the existing state-of-the-art method (PerAct)
arXiv Detail & Related papers (2023-06-26T17:59:31Z)
Imitating Task and Motion Planning with Visuomotor Transformers [71.41938181838124]
Task and Motion Planning (TAMP) can autonomously generate large-scale datasets of diverse demonstrations. In this work, we show that the combination of large-scale datasets generated by TAMP supervisors and flexible Transformer models to fit them is a powerful paradigm for robot manipulation. We present a novel imitation learning system called OPTIMUS that trains large-scale visuomotor Transformer policies by imitating a TAMP agent.
arXiv Detail & Related papers (2023-05-25T17:58:14Z)
Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our method consists of a multimodal transformer that encodes visual observations and language instructions. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z)
VIMA: General Robot Manipulation with Multimodal Prompts [82.01214865117637]
We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts. We develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively.
arXiv Detail & Related papers (2022-10-06T17:50:11Z)
Instruction-driven history-aware policies for robotic manipulations [82.25511767738224]
We propose a unified transformer-based approach that takes into account multiple inputs. In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations. We evaluate our method on the challenging RLBench benchmark and on a real-world robot.
arXiv Detail & Related papers (2022-09-11T16:28:25Z)
Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence. Transformers require minimal inductive biases for their design and are naturally suited as set-functions. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.