AerialVLN: Vision-and-Language Navigation for UAVs
- URL: http://arxiv.org/abs/2308.06735v1
- Date: Sun, 13 Aug 2023 09:55:04 GMT
- Title: AerialVLN: Vision-and-Language Navigation for UAVs
- Authors: Shubo Liu and Hongsheng Zhang and Yuankai Qi and Peng Wang and Yaning
Zhang and Qi Wu
- Abstract summary: We propose a new task named AerialVLN, which is UAV-based and towards outdoor environments.
We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios.
We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task.
- Score: 23.40363176320464
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently emerged Vision-and-Language Navigation (VLN) tasks have drawn
significant attention in both computer vision and natural language processing
communities. Existing VLN tasks are built for agents that navigate on the
ground, either indoors or outdoors. However, many tasks require intelligent
agents to carry out in the sky, such as UAV-based goods delivery,
traffic/security patrol, and scenery tour, to name a few. Navigating in the sky
is more complicated than on the ground because agents need to consider the
flying height and more complex spatial relationship reasoning. To fill this gap
and facilitate research in this field, we propose a new task named AerialVLN,
which is UAV-based and towards outdoor environments. We develop a 3D simulator
rendered by near-realistic pictures of 25 city-level scenarios. Our simulator
supports continuous navigation, environment extension and configuration. We
also proposed an extended baseline model based on the widely-used
cross-modal-alignment (CMA) navigation methods. We find that there is still a
significant gap between the baseline model and human performance, which
suggests AerialVLN is a new challenging task. Dataset and code is available at
https://github.com/AirVLN/AirVLN.
Related papers
- UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN.
It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features.
UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z) - NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation [15.628308089720269]
Vision-and-Language Navigation (VLN) aims to enable embodied agents to navigate in complicated visual environments through natural language commands.
We propose NavAgent, the first urban UAV embodied navigation model driven by a large Vision-Language Model.
We build a visual recognizer for landmark capable of identifying and linguisticizing fine-grained landmarks.
To train the visual recognizer for landmark, we develop NavAgent-Landmark2K, the first fine-grained landmark dataset for real urban street scenes.
arXiv Detail & Related papers (2024-11-13T12:51:49Z) - Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology [38.2096731046639]
Recent efforts in UAV vision-language navigation predominantly adopt ground-based VLN settings.
We propose solutions from three perspectives: platform, benchmark, and methodology.
arXiv Detail & Related papers (2024-10-09T17:29:01Z) - Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs [95.8010627763483]
Mobility VLA is a hierarchical Vision-Language-Action (VLA) navigation policy that combines the environment understanding and common sense reasoning power of long-context VLMs.
We show that Mobility VLA has a high end-to-end success rates on previously unsolved multimodal instructions.
arXiv Detail & Related papers (2024-07-10T15:49:07Z) - NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation [23.72290930234063]
NaVid is a video-based large vision language model (VLM) for vision-and-language navigation.
NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer.
arXiv Detail & Related papers (2024-02-24T16:39:16Z) - SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments [14.179677726976056]
SayNav is a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks.
SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate.
arXiv Detail & Related papers (2023-09-08T02:24:37Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - Aerial View Goal Localization with Reinforcement Learning [6.165163123577484]
We present a framework that emulates a search-and-rescue (SAR)-like setup without requiring access to actual UAVs.
In this framework, an agent operates on top of an aerial image (proxy for a search area) and is tasked with localizing a goal that is described in terms of visual cues.
We propose AiRLoc, a reinforcement learning (RL)-based model that decouples exploration (searching for distant goals) and exploitation (localizing nearby goals)
arXiv Detail & Related papers (2022-09-08T10:27:53Z) - ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in
Dynamic Environments [85.81157224163876]
We combine Vision-and-Language Navigation, assembling of collected objects, and object referring expression comprehension, to create a novel joint navigation-and-assembly task, named ArraMon.
During this task, the agent is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment.
We present results for several baseline models (integrated and biased) and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performance gap demonstrates that our task is challenging and presents a wide scope for future work.
arXiv Detail & Related papers (2020-11-15T23:30:36Z) - Sim-to-Real Transfer for Vision-and-Language Navigation [70.86250473583354]
We study the problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions.
Recent work on the task of Vision-and-Language Navigation (VLN) has achieved significant progress in simulation.
To assess the implications of this work for robotics, we transfer a VLN agent trained in simulation to a physical robot.
arXiv Detail & Related papers (2020-11-07T16:49:04Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.