Learning to Stop: A Simple yet Effective Approach to Urban
Vision-Language Navigation
- URL: http://arxiv.org/abs/2009.13112v3
- Date: Sun, 18 Oct 2020 05:41:06 GMT
- Title: Learning to Stop: A Simple yet Effective Approach to Urban
Vision-Language Navigation
- Authors: Jiannan Xiang, Xin Eric Wang, William Yang Wang
- Abstract summary: We propose Learning to Stop (L2Stop), a simple yet effective policy module that differentiates STOP and other actions.
Our approach achieves the new state of the art on a challenging urban VLN dataset Touchdown, outperforming the baseline by 6.89% (absolute improvement) on Success weighted by Edit Distance (SED)
- Score: 82.85487869172854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Navigation (VLN) is a natural language grounding task
where an agent learns to follow language instructions and navigate to specified
destinations in real-world environments. A key challenge is to recognize and
stop at the correct location, especially for complicated outdoor environments.
Existing methods treat the STOP action equally as other actions, which results
in undesirable behaviors that the agent often fails to stop at the destination
even though it might be on the right path. Therefore, we propose Learning to
Stop (L2Stop), a simple yet effective policy module that differentiates STOP
and other actions. Our approach achieves the new state of the art on a
challenging urban VLN dataset Touchdown, outperforming the baseline by 6.89%
(absolute improvement) on Success weighted by Edit Distance (SED).
Related papers
- NL-SLAM for OC-VLN: Natural Language Grounded SLAM for Object-Centric VLN [8.788856156414026]
We present a new dataset, OC-VLN, in order to distinctly evaluate grounding object-centric natural language navigation instructions.
We also propose Natural Language grounded SLAM (NL-SLAM), a method to ground natural language instruction to robot observations and poses.
arXiv Detail & Related papers (2024-11-12T15:01:40Z) - MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains [4.941781282578696]
In the Vision-and-Language Navigation (VLN) task, the agent is required to navigate to a destination following a natural language instruction.
While learning-based approaches have been a major solution to the task, they suffer from high training costs and lack of interpretability.
Recently, Large Language Models (LLMs) have emerged as a promising tool for VLN due to their strong generalization capabilities.
arXiv Detail & Related papers (2024-05-17T08:33:27Z) - Vision and Language Navigation in the Real World via Online Visual
Language Mapping [18.769171505280127]
Vision-and-language navigation (VLN) methods are mainly evaluated in simulation.
We propose a novel framework to address the VLN task in the real world.
We evaluate the proposed pipeline on an Interbotix LoCoBot WX250 in an unseen lab environment.
arXiv Detail & Related papers (2023-10-16T20:44:09Z) - Pushing it out of the Way: Interactive Visual Navigation [62.296686176988125]
We study the problem of interactive navigation where agents learn to change the environment to navigate more efficiently to their goals.
We introduce the Neural Interaction Engine (NIE) to explicitly predict the change in the environment caused by the agent's actions.
By modeling the changes while planning, we find that agents exhibit significant improvements in their navigational capabilities.
arXiv Detail & Related papers (2021-04-28T22:46:41Z) - Language-guided Navigation via Cross-Modal Grounding and Alternate
Adversarial Learning [66.9937776799536]
The emerging vision-and-language navigation (VLN) problem aims at learning to navigate an agent to the target location in unseen photo-realistic environments.
The main challenges of VLN arise mainly from two aspects: first, the agent needs to attend to the meaningful paragraphs of the language instruction corresponding to the dynamically-varying visual environments.
We propose a cross-modal grounding module to equip the agent with a better ability to track the correspondence between the textual and visual modalities.
arXiv Detail & Related papers (2020-11-22T09:13:46Z) - Multimodal Text Style Transfer for Outdoor Vision-and-Language
Navigation [71.67507925788577]
This paper introduces a Multimodal Text Style Transfer (MTST) learning approach for outdoor navigation tasks.
We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external navigation dataset.
Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task.
arXiv Detail & Related papers (2020-07-01T04:29:07Z) - Environment-agnostic Multitask Learning for Natural Language Grounded
Navigation [88.69873520186017]
We introduce a multitask navigation model that can be seamlessly trained on Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH) tasks.
Experiments show that environment-agnostic multitask learning significantly reduces the performance gap between seen and unseen environments.
arXiv Detail & Related papers (2020-03-01T09:06:31Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.