Related papers: Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models

Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models

URL: http://arxiv.org/abs/2310.15127v2
Date: Mon, 20 Nov 2023 18:51:29 GMT
Title: Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models
Authors: Gabriel Sarch, Yue Wu, Michael J. Tarr, Katerina Fragkiadaki
Abstract summary: We introduce HELPER, an embodied agent equipped with an external memory of language-program pairs. relevant memories are retrieved based on the current dialogue, instruction, correction, or VLM description. HELPER sets a new state-of-the-art in the TEACh benchmark in both Execution from Dialog History (EDH) and Trajectory from Dialogue (TfD)
Score: 19.594361652336996
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-trained and frozen large language models (LLMs) can effectively map simple scene rearrangement instructions to programs over a robot's visuomotor functions through appropriate few-shot example prompting. To parse open-domain natural language and adapt to a user's idiosyncratic procedures, not known during prompt engineering time, fixed prompts fall short. In this paper, we introduce HELPER, an embodied agent equipped with an external memory of language-program pairs that parses free-form human-robot dialogue into action programs through retrieval-augmented LLM prompting: relevant memories are retrieved based on the current dialogue, instruction, correction, or VLM description, and used as in-context prompt examples for LLM querying. The memory is expanded during deployment to include pairs of user's language and action plans, to assist future inferences and personalize them to the user's language and routines. HELPER sets a new state-of-the-art in the TEACh benchmark in both Execution from Dialog History (EDH) and Trajectory from Dialogue (TfD), with a 1.7x improvement over the previous state-of-the-art for TfD. Our models, code, and video results can be found in our project's website: https://helper-agent-llm.github.io.

Related papers

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments. We use WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. Experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models [13.963676467274109]
We extend the capabilities of HELPER by expanding its memory with a wider array of examples and prompts. This simple expansion of HELPER into a shared memory enables the agent to work across domains executing plans from dialogue, natural language instruction, active question asking, and common room reorganization. We evaluate the agent on four diverse interactive visual-language embodied agent: AChRED, TEA, DialFRED, and the Tidy Task.
arXiv Detail & Related papers (2024-04-29T19:12:42Z)
Apollonion: Profile-centric Dialog Agent [9.657755354649048]
We propose a framework for dialog agent to incorporate user profiling (initialization, update): user's query and response is analyzed and organized into a structural user profile. We propose a series of evaluation protocols for personalization: to what extend the response is personal to the different users.
arXiv Detail & Related papers (2024-04-10T03:32:41Z)
Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation [46.93969003104427]
This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM) USDM is designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech. Our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines.
arXiv Detail & Related papers (2024-02-08T14:35:09Z)
Interpreting User Requests in the Context of Natural Language Standing Instructions [89.12540932734476]
We develop NLSI, a language-to-program dataset consisting of over 2.4K dialogues spanning 17 domains. A key challenge in NLSI is to identify which subset of the standing instructions is applicable to a given dialogue.
arXiv Detail & Related papers (2023-11-16T11:19:26Z)
Instruction-Following Speech Recognition [21.591086644665197]
We introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring Large Language Models or pre-trained speech modules.
arXiv Detail & Related papers (2023-09-18T14:59:10Z)
Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models [75.98775135321355]
Given a long conversation, large language models (LLMs) fail to recall past information and tend to generate inconsistent responses. We propose to generate summaries/ memory using large language models (LLMs) to enhance long-term memory ability.
arXiv Detail & Related papers (2023-08-29T04:59:53Z)
AmadeusGPT: a natural language interface for interactive animal behavioral analysis [65.55906175884748]
We introduce AmadeusGPT: a natural language interface that turns natural language descriptions of behaviors into machine-executable code. We show we can produce state-of-the-art performance on the MABE 2022 behavior challenge tasks. AmadeusGPT presents a novel way to merge deep biological knowledge, large-language models, and core computer vision modules into a more naturally intelligent system.
arXiv Detail & Related papers (2023-07-10T19:15:17Z)
GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog. We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups. A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z)
In-Context Learning for Few-Shot Dialogue State Tracking [55.91832381893181]
We propose an in-context (IC) learning framework for few-shot dialogue state tracking (DST) A large pre-trained language model (LM) takes a test instance and a few annotated examples as input, and directly decodes the dialogue states without any parameter updates. This makes the LM more flexible and scalable compared to prior few-shot DST work when adapting to new domains and scenarios.
arXiv Detail & Related papers (2022-03-16T11:58:24Z)
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [111.33545170562337]
We investigate the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps. We find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into low-level plans. We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
arXiv Detail & Related papers (2022-01-18T18:59:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.