Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model
- URL: http://arxiv.org/abs/2305.11176v3
- Date: Wed, 24 May 2023 04:17:34 GMT
- Title: Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model
- Authors: Siyuan Huang, Zhengkai Jiang, Hao Dong, Yu Qiao, Peng Gao, Hongsheng
Li
- Abstract summary: Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks.
Our approach is adjustable and flexible in accommodating various instruction modalities and input types.
Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
- Score: 63.66204449776262
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation models have made significant strides in various applications,
including text-to-image generation, panoptic segmentation, and natural language
processing. This paper presents Instruct2Act, a framework that utilizes Large
Language Models to map multi-modal instructions to sequential actions for
robotic manipulation tasks. Specifically, Instruct2Act employs the LLM model to
generate Python programs that constitute a comprehensive perception, planning,
and action loop for robotic tasks. In the perception section, pre-defined APIs
are used to access multiple foundation models where the Segment Anything Model
(SAM) accurately locates candidate objects, and CLIP classifies them. In this
way, the framework leverages the expertise of foundation models and robotic
abilities to convert complex high-level instructions into precise policy codes.
Our approach is adjustable and flexible in accommodating various instruction
modalities and input types and catering to specific task demands. We validated
the practicality and efficiency of our approach by assessing it on robotic
tasks in different scenarios within tabletop manipulation domains. Furthermore,
our zero-shot method outperformed many state-of-the-art learning-based policies
in several tasks. The code for our proposed approach is available at
https://github.com/OpenGVLab/Instruct2Act, serving as a robust benchmark for
high-level robotic instruction tasks with assorted modality inputs.
Related papers
- LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - VIMA: General Robot Manipulation with Multimodal Prompts [82.01214865117637]
We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts.
We develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks.
We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively.
arXiv Detail & Related papers (2022-10-06T17:50:11Z) - Grounding Language with Visual Affordances over Unstructured Data [26.92329260907805]
We propose a novel approach to efficiently learn language-conditioned robot skills from unstructured, offline and reset-free data.
We exploit a self-supervised visuo-lingual affordance model, which requires as little as 1% of the total data with language.
We find that our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches.
arXiv Detail & Related papers (2022-10-04T21:16:48Z) - ProgPrompt: Generating Situated Robot Task Plans using Large Language
Models [68.57918965060787]
Large language models (LLMs) can be used to score potential next actions during task planning.
We present a programmatic LLM prompt structure that enables plan generation functional across situated environments.
arXiv Detail & Related papers (2022-09-22T20:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.