March in Chat: Interactive Prompting for Remote Embodied Referring
Expression
- URL: http://arxiv.org/abs/2308.10141v1
- Date: Sun, 20 Aug 2023 03:00:20 GMT
- Title: March in Chat: Interactive Prompting for Remote Embodied Referring
Expression
- Authors: Yanyuan Qiao, Yuankai Qi, Zheng Yu, Jing Liu, Qi Wu
- Abstract summary: This paper proposes a March-in-Chat (MiC) model that can talk to the LLM on the fly and plan dynamically based on a newly proposed Room-and-Object Aware Scene Perceiver (ROASP)
Our MiC model outperforms the previous state-of-the-art by large margins by SPL and RGSPL metrics on the REVERIE benchmark.
- Score: 33.64407469423714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many Vision-and-Language Navigation (VLN) tasks have been proposed in recent
years, from room-based to object-based and indoor to outdoor. The REVERIE
(Remote Embodied Referring Expression) is interesting since it only provides
high-level instructions to the agent, which are closer to human commands in
practice. Nevertheless, this poses more challenges than other VLN tasks since
it requires agents to infer a navigation plan only based on a short
instruction. Large Language Models (LLMs) show great potential in robot action
planning by providing proper prompts. Still, this strategy has not been
explored under the REVERIE settings. There are several new challenges. For
example, the LLM should be environment-aware so that the navigation plan can be
adjusted based on the current visual observation. Moreover, the LLM planned
actions should be adaptable to the much larger and more complex REVERIE
environment. This paper proposes a March-in-Chat (MiC) model that can talk to
the LLM on the fly and plan dynamically based on a newly proposed
Room-and-Object Aware Scene Perceiver (ROASP). Our MiC model outperforms the
previous state-of-the-art by large margins by SPL and RGSPL metrics on the
REVERIE benchmark.
Related papers
- NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic
Tabletop Manipulation [38.66406497318709]
This work focuses on the tabletop manipulation task and releases a simulation benchmark, textitLoHoRavens, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference.
We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM.
arXiv Detail & Related papers (2023-10-18T14:53:14Z) - SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments [14.179677726976056]
SayNav is a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks.
SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate.
arXiv Detail & Related papers (2023-09-08T02:24:37Z) - Ground Manipulator Primitive Tasks to Executable Actions using Large
Language Models [13.827349677538352]
We propose a novel approach to ground the manipulator primitive tasks to robot low-level actions using large language models (LLMs)
In this way, we enable LLMs to generate position/force set-points for hybrid control.
arXiv Detail & Related papers (2023-08-13T16:52:36Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers [20.857692296678632]
For effective human-robot interaction, robots need to understand, plan, and execute complex, long-horizon tasks.
Recent advances in large language models have shown promise for translating natural language into robot action sequences.
We show that our approach outperforms several methods using LLMs as planners in complex task domains.
arXiv Detail & Related papers (2023-06-10T21:58:29Z) - NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large
Language Models [17.495162643127003]
We introduce the NavGPT to reveal the reasoning capability of GPT models in complex embodied scenes.
NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status.
We show that NavGPT is capable of generating high-quality navigational instructions from observations and actions along a path.
arXiv Detail & Related papers (2023-05-26T14:41:06Z) - Plan, Eliminate, and Track -- Language Models are Good Teachers for
Embodied Agents [99.17668730578586]
Pre-trained large language models (LLMs) capture procedural knowledge about the world.
Plan, Eliminate, and Track (PET) framework translates a task description into a list of high-level sub-tasks.
PET framework leads to a significant 15% improvement over SOTA for generalization to human goal specifications.
arXiv Detail & Related papers (2023-05-03T20:11:22Z) - Open-vocabulary Queryable Scene Representations for Real World Planning [56.175724306976505]
Large language models (LLMs) have unlocked new capabilities of task planning from human instructions.
However, prior attempts to apply LLMs to real-world robotic tasks are limited by the lack of grounding in the surrounding scene.
We develop NLMap, an open-vocabulary and queryable scene representation to address this problem.
arXiv Detail & Related papers (2022-09-20T17:29:56Z) - Language Models as Zero-Shot Planners: Extracting Actionable Knowledge
for Embodied Agents [111.33545170562337]
We investigate the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps.
We find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into low-level plans.
We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
arXiv Detail & Related papers (2022-01-18T18:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.