Prompter: Utilizing Large Language Model Prompting for a Data Efficient
Embodied Instruction Following
- URL: http://arxiv.org/abs/2211.03267v2
- Date: Tue, 12 Mar 2024 09:01:54 GMT
- Title: Prompter: Utilizing Large Language Model Prompting for a Data Efficient
Embodied Instruction Following
- Authors: Yuki Inoue and Hiroki Ohashi
- Abstract summary: Embodied Instruction Following studies how autonomous mobile manipulation robots should be controlled to accomplish long-horizon tasks.
We show that embedding the physical constraints of the deployed robots into the module design is highly effective.
Our design also allows the same modular system to work across robots of different configurations with minimal modifications.
- Score: 4.532517021515834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embodied Instruction Following (EIF) studies how autonomous mobile
manipulation robots should be controlled to accomplish long-horizon tasks
described by natural language instructions. While much research on EIF is
conducted in simulators, the ultimate goal of the field is to deploy the agents
in real life. This is one of the reasons why recent methods have moved away
from training models end-to-end and take modular approaches, which do not need
the costly expert operation data. However, as it is still in the early days of
importing modular ideas to EIF, a search for modules effective in the EIF task
is still far from a conclusion. In this paper, we propose to extend the modular
design using knowledge obtained from two external sources. First, we show that
embedding the physical constraints of the deployed robots into the module
design is highly effective. Our design also allows the same modular system to
work across robots of different configurations with minimal modifications.
Second, we show that the landmark-based object search, previously implemented
by a trained model requiring a dedicated set of data, can be replaced by an
implementation that prompts pretrained large language models for
landmark-object relationships, eliminating the need for collecting dedicated
training data. Our proposed Prompter achieves 41.53\% and 45.32\% on the ALFRED
benchmark with high-level instructions only and step-by-step instructions,
respectively, significantly outperforming the previous state of the art by
5.46\% and 9.91\%.
Related papers
- Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning [15.03025428687218]
Object State-Sensitive Agent (OSSA) is a task-planning agent empowered by pre-trained neural networks.
We propose two methods for OSSA: (i) a modular model consisting of a pre-trained vision processing module and a natural language processing model (LLM), and (ii) a monolithic model consisting only of a VLM.
Our results show that both methods can be used for object state-sensitive tasks, but the monolithic approach outperforms the modular approach.
arXiv Detail & Related papers (2024-06-14T12:52:42Z) - Forging Vision Foundation Models for Autonomous Driving: Challenges,
Methodologies, and Opportunities [59.02391344178202]
Vision foundation models (VFMs) serve as potent building blocks for a wide range of AI applications.
The scarcity of comprehensive training data, the need for multi-sensor integration, and the diverse task-specific architectures pose significant obstacles to the development of VFMs.
This paper delves into the critical challenge of forging VFMs tailored specifically for autonomous driving, while also outlining future directions.
arXiv Detail & Related papers (2024-01-16T01:57:24Z) - Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning [50.47568731994238]
Key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL)
This paper presents a general framework model for integrating and learning structured reasoning into AI agents' policies.
arXiv Detail & Related papers (2023-12-22T17:57:57Z) - ModuleFormer: Modularity Emerges from Mixture-of-Experts [60.6148988099284]
This paper proposes a new neural network architecture, ModuleFormer, to improve the efficiency and flexibility of large language models.
Unlike the previous SMoE-based modular language model, ModuleFormer can induce modularity from uncurated data.
arXiv Detail & Related papers (2023-06-07T17:59:57Z) - Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks.
Our approach is adjustable and flexible in accommodating various instruction modalities and input types.
Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z) - Modular Deep Learning [120.36599591042908]
Transfer learning has recently become the dominant paradigm of machine learning.
It remains unclear how to develop models that specialise towards multiple tasks without incurring negative interference.
Modular deep learning has emerged as a promising solution to these challenges.
arXiv Detail & Related papers (2023-02-22T18:11:25Z) - Modular Framework for Visuomotor Language Grounding [57.93906820466519]
Natural language instruction following tasks serve as a valuable test-bed for grounded language and robotics research.
We propose the structuring of language, acting, and visual tasks into separate modules that can be trained independently.
arXiv Detail & Related papers (2021-09-05T20:11:53Z) - Self-training Improves Pre-training for Few-shot Learning in
Task-oriented Dialog Systems [47.937191088981436]
Large-scale pre-trained language models, have shown promising results for few-shot learning in ToD.
We propose a self-training approach that iteratively labels the most confident unlabeled data to train a stronger Student model.
We conduct experiments and present analyses on four downstream tasks in ToD, including intent classification, dialog state tracking, dialog act prediction, and response selection.
arXiv Detail & Related papers (2021-08-28T07:22:06Z) - A Data Efficient End-To-End Spoken Language Understanding Architecture [22.823732899634518]
We introduce a data efficient system which is trained end-to-end, with no additional, pre-trained external module.
The proposed model achieves a reasonable size and competitive results with respect to state-of-the-art while using a small training dataset.
arXiv Detail & Related papers (2020-02-14T10:24:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.