Body Transformer: Leveraging Robot Embodiment for Policy Learning
- URL: http://arxiv.org/abs/2408.06316v1
- Date: Mon, 12 Aug 2024 17:31:28 GMT
- Title: Body Transformer: Leveraging Robot Embodiment for Policy Learning
- Authors: Carmelo Sferrazza, Dun-Ming Huang, Fangchen Liu, Jongmin Lee, Pieter Abbeel,
- Abstract summary: Body Transformer (BoT) is an architecture that leverages the robot embodiment by providing an inductive bias that guides the learning process.
We represent the robot body as a graph of sensors and actuators, and rely on masked attention to pool information throughout the architecture.
The resulting architecture outperforms the vanilla transformer, as well as the classical multilayer perceptron, in terms of task completion, scaling properties, and computational efficiency.
- Score: 51.531793239586165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the transformer architecture has become the de facto standard for machine learning algorithms applied to natural language processing and computer vision. Despite notable evidence of successful deployment of this architecture in the context of robot learning, we claim that vanilla transformers do not fully exploit the structure of the robot learning problem. Therefore, we propose Body Transformer (BoT), an architecture that leverages the robot embodiment by providing an inductive bias that guides the learning process. We represent the robot body as a graph of sensors and actuators, and rely on masked attention to pool information throughout the architecture. The resulting architecture outperforms the vanilla transformer, as well as the classical multilayer perceptron, in terms of task completion, scaling properties, and computational efficiency when representing either imitation or reinforcement learning policies. Additional material including the open-source code is available at https://sferrazza.cc/bot_site.
Related papers
- $π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.
We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z) - The Ingredients for Robotic Diffusion Transformers [47.61690903645525]
We identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies.
The resulting models can efficiently solve diverse tasks on multiple robot embodiments.
We find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data.
arXiv Detail & Related papers (2024-10-14T02:02:54Z) - RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation.
We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language.
We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z) - RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation [33.10577695383743]
We propose a multi-embodiment, multi-task generalist agent for robotic manipulation called RoboCat.
This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions.
With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100-1000 examples.
arXiv Detail & Related papers (2023-06-20T17:35:20Z) - RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties.
We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z) - Instruction-driven history-aware policies for robotic manipulations [82.25511767738224]
We propose a unified transformer-based approach that takes into account multiple inputs.
In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations.
We evaluate our method on the challenging RLBench benchmark and on a real-world robot.
arXiv Detail & Related papers (2022-09-11T16:28:25Z) - What Matters in Language Conditioned Robotic Imitation Learning [26.92329260907805]
We study the most critical challenges in learning language conditioned policies from offline free-form imitation datasets.
We present a novel approach that significantly outperforms the state of the art on the challenging language conditioned long-horizon robot manipulation CALVIN benchmark.
arXiv Detail & Related papers (2022-04-13T08:45:32Z) - MetaMorph: Learning Universal Controllers with Transformers [45.478223199658785]
In robotics we primarily train a single robot for a single task.
modular robot systems now allow for the flexible combination of general-purpose building blocks into task optimized morphologies.
We propose MetaMorph, a Transformer based approach to learn a universal controller over a modular robot design space.
arXiv Detail & Related papers (2022-03-22T17:58:31Z) - Transformer-based deep imitation learning for dual-arm robot
manipulation [5.3022775496405865]
In a dual-arm manipulation setup, the increased number of state dimensions caused by the additional robot manipulators causes distractions.
We address this issue using a self-attention mechanism that computes dependencies between elements in a sequential input and focuses on important elements.
A Transformer, a variant of self-attention architecture, is applied to deep imitation learning to solve dual-arm manipulation tasks in the real world.
arXiv Detail & Related papers (2021-08-01T07:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.