Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions
- URL: http://arxiv.org/abs/2408.04168v3
- Date: Thu, 17 Oct 2024 06:43:13 GMT
- Title: Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions
- Authors: Qingbin Zeng, Qinglong Yang, Shunan Dong, Heming Du, Liang Zheng, Fengli Xu, Yong Li,
- Abstract summary: This paper introduces a novel agentic workflow featured by its abilities to perceive, reflect and plan.
We find LLaVA-7B can be fine-tuned to perceive the direction and distance of landmarks with sufficient accuracy for city navigation.
- Score: 19.03156236107806
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper considers a scenario in city navigation: an AI agent is provided with language descriptions of the goal location with respect to some well-known landmarks; By only observing the scene around, including recognizing landmarks and road network connections, the agent has to make decisions to navigate to the goal location without instructions. This problem is very challenging, because it requires agent to establish self-position and acquire spatial representation of complex urban environment, where landmarks are often invisible. In the absence of navigation instructions, such abilities are vital for the agent to make high-quality decisions in long-range city navigation. With the emergent reasoning ability of large language models (LLMs), a tempting baseline is to prompt LLMs to "react" on each observation and make decisions accordingly. However, this baseline has very poor performance that the agent often repeatedly visits same locations and make short-sighted, inconsistent decisions. To address these issues, this paper introduces a novel agentic workflow featured by its abilities to perceive, reflect and plan. Specifically, we find LLaVA-7B can be fine-tuned to perceive the direction and distance of landmarks with sufficient accuracy for city navigation. Moreover, reflection is achieved through a memory mechanism, where past experiences are stored and can be retrieved with current perception for effective decision argumentation. Planning uses reflection results to produce long-term plans, which can avoid short-sighted decisions in long-range navigation. We show the designed workflow significantly improves navigation ability of the LLM agent compared with the state-of-the-art baselines.
Related papers
- CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory [39.76840258489023]
Aerial vision-and-language navigation (VLN) requires drones to interpret natural language instructions and navigate complex urban environments.<n>We propose textbfCityNavAgent, a large language model (LLM)-empowered agent that significantly reduces the navigation complexity for urban aerial VLN.
arXiv Detail & Related papers (2025-05-08T20:01:35Z) - Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation [35.71602601385161]
We present a novel vision-language model (VLM)-based navigation framework.
Our approach enhances spatial reasoning and decision-making in long-horizon tasks.
Experimental results demonstrate that the proposed method surpasses previous state-of-the-art approaches in object navigation tasks.
arXiv Detail & Related papers (2025-02-20T04:41:40Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large
Language Models [17.495162643127003]
We introduce the NavGPT to reveal the reasoning capability of GPT models in complex embodied scenes.
NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status.
We show that NavGPT is capable of generating high-quality navigational instructions from observations and actions along a path.
arXiv Detail & Related papers (2023-05-26T14:41:06Z) - What do navigation agents learn about their environment? [39.74076893981299]
We introduce the Interpretability System for Embodied agEnts (iSEE) for Point Goal and Object Goal navigation agents.
We use iSEE to probe the dynamic representations produced by these agents for the presence of information about the agent as well as the environment.
arXiv Detail & Related papers (2022-06-17T01:33:43Z) - Teaching Agents how to Map: Spatial Reasoning for Multi-Object
Navigation [11.868792440783055]
We show that learning to estimate metrics quantifying the spatial relationships between an agent at a given location and a goal to reach has a high positive impact in Multi-Object Navigation settings.
A learning-based agent from the literature trained with the proposed auxiliary losses was the winning entry to the Multi-Object Navigation Challenge.
arXiv Detail & Related papers (2021-07-13T12:01:05Z) - Diagnosing Vision-and-Language Navigation: What Really Matters [61.72935815656582]
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments.
Recent studies witness a slow-down in the performance improvements in both indoor and outdoor VLN tasks.
In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation.
arXiv Detail & Related papers (2021-03-30T17:59:07Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z) - Occupancy Anticipation for Efficient Exploration and Navigation [97.17517060585875]
We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions.
By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment.
Our approach is the winning entry in the 2020 Habitat PointNav Challenge.
arXiv Detail & Related papers (2020-08-21T03:16:51Z) - Active Visual Information Gathering for Vision-Language Navigation [115.40768457718325]
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments.
One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment.
This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent VLN policy.
arXiv Detail & Related papers (2020-07-15T23:54:20Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.