Related papers: LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

URL: http://arxiv.org/abs/2406.20095v1
Date: Fri, 28 Jun 2024 17:59:12 GMT
Title: LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo,
Abstract summary: Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains. We propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations.
Score: 56.505551117094534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

Related papers

Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA) LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z)
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model [4.557035895252272]
Vision Language Models (VLMs) have been adopted in robotics for their capability in common sense reasoning and generalizability. In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning. We named it SeeDo because it enables the VLM to ''see'' human demonstrations and explain the corresponding plans to the robot for it to ''do''
arXiv Detail & Related papers (2024-10-11T13:17:52Z)
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data [45.25288643161976]
We propose Keypoint Affordance Learning from Imagined Environments (KALIE) for robotic control in a scalable manner. Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations. We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points.
arXiv Detail & Related papers (2024-09-21T08:45:16Z)
CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving [1.727597257312416]
CoVLA (Comprehensive Vision-Language-Action) dataset comprises real-world driving videos spanning more than 80 hours. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems.
arXiv Detail & Related papers (2024-08-19T09:53:49Z)
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments. We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model. Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z)
PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning [2.7387720378113554]
Preference-based reinforcement learning (RL) has emerged as a new field in robot learning. We use the zero-shot capabilities of a large language model (LLM) to reason from the text provided by humans. In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications.
arXiv Detail & Related papers (2024-02-23T16:30:05Z)
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [140.14239499047977]
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. We propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT) We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
arXiv Detail & Related papers (2024-02-12T18:33:47Z)
Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z)
Vision-Language Foundation Models as Effective Robot Imitators [48.73027330407576]
We derive a vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control.
arXiv Detail & Related papers (2023-11-02T16:34:33Z)
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control. Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z)
Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models [70.82705830137708]
We introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL) We utilize semi-language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data. DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.
arXiv Detail & Related papers (2022-11-21T18:56:00Z)
Modular Framework for Visuomotor Language Grounding [57.93906820466519]
Natural language instruction following tasks serve as a valuable test-bed for grounded language and robotics research. We propose the structuring of language, acting, and visual tasks into separate modules that can be trained independently.
arXiv Detail & Related papers (2021-09-05T20:11:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.