LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
- URL: http://arxiv.org/abs/2406.20095v1
- Date: Fri, 28 Jun 2024 17:59:12 GMT
- Title: LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
- Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo,
- Abstract summary: Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains.
We propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations.
- Score: 56.505551117094534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.
Related papers
- LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual
Prompting [106.53784213239479]
We present MOKA (Marking Open-vocabulary Keypoint Affordances), an approach that employs vision language models to solve robotic manipulation tasks.
At the heart of our approach is a compact point-based representation of affordance and motion that bridges the VLM's predictions on RGB images and the robot's motions in the physical world.
We evaluate and analyze MOKA's performance on a variety of manipulation tasks specified by free-form language descriptions.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - PREDILECT: Preferences Delineated with Zero-Shot Language-based
Reasoning in Reinforcement Learning [2.7387720378113554]
Preference-based reinforcement learning (RL) has emerged as a new field in robot learning.
We use the zero-shot capabilities of a large language model (LLM) to reason from the text provided by humans.
In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications.
arXiv Detail & Related papers (2024-02-23T16:30:05Z) - Large Language Models for Robotics: Opportunities, Challenges, and
Perspectives [46.57277568357048]
Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains.
For embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception.
We propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.
arXiv Detail & Related papers (2024-01-09T03:22:16Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Vision-Language Foundation Models as Effective Robot Imitators [48.73027330407576]
We derive a vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo.
By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control.
arXiv Detail & Related papers (2023-11-02T16:34:33Z) - Chat with the Environment: Interactive Multimodal Perception Using Large
Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning.
Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z) - Robotic Skill Acquisition via Instruction Augmentation with
Vision-Language Models [70.82705830137708]
We introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL)
We utilize semi-language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data.
DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.
arXiv Detail & Related papers (2022-11-21T18:56:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.