Vision and Language Navigation in the Real World via Online Visual
Language Mapping
- URL: http://arxiv.org/abs/2310.10822v1
- Date: Mon, 16 Oct 2023 20:44:09 GMT
- Title: Vision and Language Navigation in the Real World via Online Visual
Language Mapping
- Authors: Chengguang Xu, Hieu T. Nguyen, Christopher Amato, Lawson L.S. Wong
- Abstract summary: Vision-and-language navigation (VLN) methods are mainly evaluated in simulation.
We propose a novel framework to address the VLN task in the real world.
We evaluate the proposed pipeline on an Interbotix LoCoBot WX250 in an unseen lab environment.
- Score: 18.769171505280127
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Navigating in unseen environments is crucial for mobile robots. Enhancing
them with the ability to follow instructions in natural language will further
improve navigation efficiency in unseen cases. However, state-of-the-art (SOTA)
vision-and-language navigation (VLN) methods are mainly evaluated in
simulation, neglecting the complex and noisy real world. Directly transferring
SOTA navigation policies trained in simulation to the real world is challenging
due to the visual domain gap and the absence of prior knowledge about unseen
environments. In this work, we propose a novel navigation framework to address
the VLN task in the real world. Utilizing the powerful foundation models, the
proposed framework includes four key components: (1) an LLMs-based instruction
parser that converts the language instruction into a sequence of pre-defined
macro-action descriptions, (2) an online visual-language mapper that builds a
real-time visual-language map to maintain a spatial and semantic understanding
of the unseen environment, (3) a language indexing-based localizer that grounds
each macro-action description into a waypoint location on the map, and (4) a
DD-PPO-based local controller that predicts the action. We evaluate the
proposed pipeline on an Interbotix LoCoBot WX250 in an unseen lab environment.
Without any fine-tuning, our pipeline significantly outperforms the SOTA VLN
baseline in the real world.
Related papers
- NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation [73.78984332354636]
CorNav is a novel zero-shot framework for vision-and-language navigation.
It incorporates environmental feedback for refining future plans and adjusting its actions.
It consistently outperforms all baselines in a zero-shot multi-task setting.
arXiv Detail & Related papers (2023-06-17T11:44:04Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - Learning Synthetic to Real Transfer for Localization and Navigational
Tasks [7.019683407682642]
Navigation is at the crossroad of multiple disciplines, it combines notions of computer vision, robotics and control.
This work aimed at creating, in a simulation, a navigation pipeline whose transfer to the real world could be done with as few efforts as possible.
To design the navigation pipeline four main challenges arise; environment, localization, navigation and planning.
arXiv Detail & Related papers (2020-11-20T08:37:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.