The Ingredients for Robotic Diffusion Transformers
- URL: http://arxiv.org/abs/2410.10088v1
- Date: Mon, 14 Oct 2024 02:02:54 GMT
- Title: The Ingredients for Robotic Diffusion Transformers
- Authors: Sudeep Dasari, Oier Mees, Sebastian Zhao, Mohan Kumar Srirama, Sergey Levine,
- Abstract summary: We identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies.
The resulting models can efficiently solve diverse tasks on multiple robot embodiments.
We find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data.
- Score: 47.61690903645525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named \method, that significantly outperforms the state of the art in solving long-horizon ($1500+$ time-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: https://dit-policy.github.io
Related papers
- VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency.
Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z) - RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation [23.554917579133576]
We present Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation.
RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer.
We further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots.
arXiv Detail & Related papers (2024-10-10T12:33:46Z) - Solving Multi-Goal Robotic Tasks with Decision Transformer [0.0]
We introduce a novel adaptation of the decision transformer architecture for offline multi-goal reinforcement learning in robotics.
Our approach integrates goal-specific information into the decision transformer, allowing it to handle complex tasks in an offline setting.
arXiv Detail & Related papers (2024-10-08T20:35:30Z) - Body Transformer: Leveraging Robot Embodiment for Policy Learning [51.531793239586165]
Body Transformer (BoT) is an architecture that leverages the robot embodiment by providing an inductive bias that guides the learning process.
We represent the robot body as a graph of sensors and actuators, and rely on masked attention to pool information throughout the architecture.
The resulting architecture outperforms the vanilla transformer, as well as the classical multilayer perceptron, in terms of task completion, scaling properties, and computational efficiency.
arXiv Detail & Related papers (2024-08-12T17:31:28Z) - DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative
Diffusion Models [102.13968267347553]
We present DiffuseBot, a physics-augmented diffusion model that generates soft robot morphologies capable of excelling in a wide spectrum of tasks.
We showcase a range of simulated and fabricated robots along with their capabilities.
arXiv Detail & Related papers (2023-11-28T18:58:48Z) - Scaling Robot Learning with Semantically Imagined Experience [21.361979238427722]
Recent advances in robot learning have shown promise in enabling robots to perform manipulation tasks.
One of the key contributing factors to this progress is the scale of robot data used to train the models.
We propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing.
arXiv Detail & Related papers (2023-02-22T18:47:51Z) - RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties.
We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z) - PACT: Perception-Action Causal Transformer for Autoregressive Robotics
Pre-Training [25.50131893785007]
This work introduces a paradigm for pre-training a general purpose representation that can serve as a starting point for multiple tasks on a given robot.
We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion.
We show that finetuning small task-specific networks on top of the larger pretrained model results in significantly better performance compared to training a single model from scratch for all tasks simultaneously.
arXiv Detail & Related papers (2022-09-22T16:20:17Z) - Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation.
A core challenge is to generalize the manipulation skills to objects in different locations.
We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.