NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning
- URL: http://arxiv.org/abs/2403.07376v1
- Date: Tue, 12 Mar 2024 07:27:02 GMT
- Title: NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning
- Authors: Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma,
Jianhua Han, Hang Xu, Xiaojun Chang, Xiaodan Liang
- Abstract summary: Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
- Score: 101.56342075720588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Navigation (VLN), as a crucial research problem of
Embodied AI, requires an embodied agent to navigate through complex 3D
environments following natural language instructions. Recent research has
highlighted the promising capacity of large language models (LLMs) in VLN by
improving navigational reasoning accuracy and interpretability. However, their
predominant use in an offline manner usually suffers from substantial domain
gap between the VLN task and the LLM training corpus. This paper introduces a
novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill
parameter-efficient in-domain training to enable self-guided navigational
decision, leading to a significant mitigation of the domain gap in a
cost-effective manner. Specifically, at each timestep, the LLM is prompted to
forecast the navigational chain-of-thought by: 1) acting as a world model to
imagine the next observation according to the instruction, 2) selecting the
candidate observation that best aligns with the imagination, and 3) determining
the action based on the reasoning from the prior steps. Through constructing
formalized labels for training, the LLM can learn to generate desired and
reasonable chain-of-thought outputs for improving the action decision.
Experimental results across various training settings and popular VLN
benchmarks (e.g., Room-to-Room (R2R), Room-across-Room (RxR), Room-for-Room
(R4R)) show the significant superiority of NavCoT over the direct action
prediction variants. Through simple parameter-efficient finetuning, our NavCoT
outperforms a recent GPT4-based approach with ~7% relative improvement on the
R2R dataset. We believe that NavCoT will help unlock more task-adaptive and
scalable LLM-based embodied agents, which are helpful for developing real-world
robotics applications. Code is available at
https://github.com/expectorlin/NavCoT.
Related papers
- Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs [41.90732562248243]
Vision-and-Language Navigation (VLN) tasks require an agent to follow textual instructions to navigate through 3D environments.
Recent methods try to utilize closed-source large language models (LLMs) to solve VLN tasks in zero-shot manners.
We introduce Open-Nav, a novel study that explores open-source LLMs for zero-shot VLN in the continuous environment.
arXiv Detail & Related papers (2024-09-27T14:47:18Z) - Correctable Landmark Discovery via Large Models for Vision-Language Navigation [89.15243018016211]
Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position.
Previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes.
We propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE)
arXiv Detail & Related papers (2024-05-29T03:05:59Z) - MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains [4.941781282578696]
In the Vision-and-Language Navigation (VLN) task, the agent is required to navigate to a destination following a natural language instruction.
While learning-based approaches have been a major solution to the task, they suffer from high training costs and lack of interpretability.
Recently, Large Language Models (LLMs) have emerged as a promising tool for VLN due to their strong generalization capabilities.
arXiv Detail & Related papers (2024-05-17T08:33:27Z) - TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation [11.591176410027224]
This paper presents a Vision-Language Navigation (VLN) agent based on Large Language Models (LLMs)
We propose the Thinking, Interacting, and Action framework to compensate for the shortcomings of LLMs in environmental perception.
Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
arXiv Detail & Related papers (2024-03-13T05:22:39Z) - Towards Deviation-Robust Agent Navigation via Perturbation-Aware
Contrastive Learning [125.61772424068903]
Vision-and-language navigation (VLN) asks an agent to follow a given language instruction to navigate through a real 3D environment.
We present a model-agnostic training paradigm, called Progressive Perturbation-aware Contrastive Learning (PROPER) to enhance the generalization ability of existing VLN agents.
arXiv Detail & Related papers (2024-03-09T02:34:13Z) - Mind the Gap: Improving Success Rate of Vision-and-Language Navigation
by Revisiting Oracle Success Routes [25.944819618283613]
Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction.
We make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR)
arXiv Detail & Related papers (2023-08-07T01:43:25Z) - NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large
Language Models [17.495162643127003]
We introduce the NavGPT to reveal the reasoning capability of GPT models in complex embodied scenes.
NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status.
We show that NavGPT is capable of generating high-quality navigational instructions from observations and actions along a path.
arXiv Detail & Related papers (2023-05-26T14:41:06Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - ULN: Towards Underspecified Vision-and-Language Navigation [77.81257404252132]
Underspecified vision-and-Language Navigation (ULN) is a new setting for vision-and-Language Navigation (VLN)
We propose a VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module.
Our framework is more robust and outperforms the baselines on ULN by 10% relative success rate across all levels.
arXiv Detail & Related papers (2022-10-18T17:45:06Z) - Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions.
Recent methods explore pretraining to improve generalization of VLN agents.
We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.