Related papers: NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

URL: http://arxiv.org/abs/2305.16986v3
Date: Thu, 19 Oct 2023 17:59:43 GMT
Title: NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
Authors: Gengze Zhou, Yicong Hong, Qi Wu
Abstract summary: We introduce the NavGPT to reveal the reasoning capability of GPT models in complex embodied scenes. NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status. We show that NavGPT is capable of generating high-quality navigational instructions from observations and actions along a path.
Score: 17.495162643127003
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goal, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models.

Related papers

NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM [55.79954652783797]
Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions. Previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users' communication styles. We propose NavRAG, a retrieval-augmented generation framework that generates user demand instructions for VLN.
arXiv Detail & Related papers (2025-02-16T14:17:36Z)
Navigation World Models [68.58459393846461]
We introduce a controllable video generation model that predicts future visual observations based on past observations and navigation actions. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy.
arXiv Detail & Related papers (2024-12-04T18:59:45Z)
TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation [52.422619828854984]
We introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information. To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method.
arXiv Detail & Related papers (2024-11-25T14:27:55Z)
Guide-LLM: An Embodied LLM Agent and Text-Based Topological Map for Robotic Guidance of People with Visual Impairments [1.18749525824656]
Guide-LLM is a text-based agent designed to assist persons with visual impairments (PVI) in navigating large indoor environments. Our approach features a novel text-based topological map that enables the LLM to plan global paths. Simulated experiments demonstrate the system's efficacy in guiding PVI, underscoring its potential as a significant advancement in assistive technology.
arXiv Detail & Related papers (2024-10-28T01:58:21Z)
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models [30.685419129265252]
We bridge the divide between VLN-specialized models and LLM-based navigation paradigms. We exploit a way to incorporate LLMs and navigation policy networks for effective action predictions and navigational reasoning.
arXiv Detail & Related papers (2024-07-17T07:44:26Z)
Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z)
MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains [4.941781282578696]
In the Vision-and-Language Navigation (VLN) task, the agent is required to navigate to a destination following a natural language instruction. While learning-based approaches have been a major solution to the task, they suffer from high training costs and lack of interpretability. Recently, Large Language Models (LLMs) have emerged as a promising tool for VLN due to their strong generalization capabilities.
arXiv Detail & Related papers (2024-05-17T08:33:27Z)
TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation [11.591176410027224]
This paper presents a Vision-Language Navigation (VLN) agent based on Large Language Models (LLMs) We propose the Thinking, Interacting, and Action framework to compensate for the shortcomings of LLMs in environmental perception. Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
arXiv Detail & Related papers (2024-03-13T05:22:39Z)
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions. Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability. This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z)
$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions. Normally, the instructions have complex grammatical structures and often contain various action descriptions. How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z)
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z)
Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON) Our approach makes use of Large Language Models (LLMs) for this task. We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z)
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories. We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.