RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control
- URL: http://arxiv.org/abs/2307.15818v1
- Date: Fri, 28 Jul 2023 21:18:02 GMT
- Title: RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control
- Authors: Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi
Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey,
Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana
Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu,
Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov,
Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao
Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista
Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet,
Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke,
Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia,
Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich
- Abstract summary: We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
- Score: 140.48218261864153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study how vision-language models trained on Internet-scale data can be
incorporated directly into end-to-end robotic control to boost generalization
and enable emergent semantic reasoning. Our goal is to enable a single
end-to-end trained model to both learn to map robot observations to actions and
enjoy the benefits of large-scale pretraining on language and vision-language
data from the web. To this end, we propose to co-fine-tune state-of-the-art
vision-language models on both robotic trajectory data and Internet-scale
vision-language tasks, such as visual question answering. In contrast to other
approaches, we propose a simple, general recipe to achieve this goal: in order
to fit both natural language responses and robotic actions into the same
format, we express the actions as text tokens and incorporate them directly
into the training set of the model in the same way as natural language tokens.
We refer to such category of models as vision-language-action models (VLA) and
instantiate an example of such a model, which we call RT-2. Our extensive
evaluation (6k evaluation trials) shows that our approach leads to performant
robotic policies and enables RT-2 to obtain a range of emergent capabilities
from Internet-scale training. This includes significantly improved
generalization to novel objects, the ability to interpret commands not present
in the robot training data (such as placing an object onto a particular number
or icon), and the ability to perform rudimentary reasoning in response to user
commands (such as picking up the smallest or largest object, or the one closest
to another object). We further show that incorporating chain of thought
reasoning allows RT-2 to perform multi-stage semantic reasoning, for example
figuring out which object to pick up for use as an improvised hammer (a rock),
or which type of drink is best suited for someone who is tired (an energy
drink).
Related papers
- Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)
LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.
We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data [45.25288643161976]
We propose Keypoint Affordance Learning from Imagined Environments (KALIE) for robotic control in a scalable manner.
Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations.
We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points.
arXiv Detail & Related papers (2024-09-21T08:45:16Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - LIV: Language-Image Representations and Rewards for Robotic Control [37.12560985663822]
We present a unified objective for vision-language representation and reward learning from action-free videos with text annotations.
We use LIV to pre-train the first control-centric vision-language representation from large human video datasets such as EpicKitchen.
Our results validate the advantages of joint vision-language representation and reward learning within the unified, compact LIV framework.
arXiv Detail & Related papers (2023-06-01T17:52:23Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Language-Driven Representation Learning for Robotics [115.93273609767145]
Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks.
We introduce a framework for language-driven representation learning from human videos and captions.
We find that Voltron's language-driven learning outperform the prior-of-the-art, especially on targeted problems requiring higher-level control.
arXiv Detail & Related papers (2023-02-24T17:29:31Z) - Learning Flexible Translation between Robot Actions and Language
Descriptions [16.538887534958555]
We propose a paired gated autoencoders (PGAE) for flexible translation between robot actions and language descriptions.
We train our model in an end-to-end fashion by pairing each action with appropriate descriptions that contain a signal informing about the translation direction.
With the option to use a pretrained language model as the language encoder, our model has the potential to recognise unseen natural language input.
arXiv Detail & Related papers (2022-07-15T12:37:05Z) - Language Model-Based Paired Variational Autoencoders for Robotic Language Learning [18.851256771007748]
Similar to human infants, artificial agents can learn language while interacting with their environment.
We present a neural model that bidirectionally binds robot actions and their language descriptions in a simple object manipulation scenario.
Next, we introduce PVAE-BERT, which equips the model with a pretrained large-scale language model.
arXiv Detail & Related papers (2022-01-17T10:05:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.