Related papers: Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation

Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation

URL: http://arxiv.org/abs/2406.09798v3
Date: Mon, 14 Oct 2024 04:48:18 GMT
Title: Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation
Authors: Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Shuqiang Jiang,
Abstract summary: Vision-and-language navigation (VLN) enables the agent to navigate to a remote location in 3D environments following the natural language instruction. In this work, we propose a sim-to-real transfer approach to endow the monocular robots with panoramic traversability perception and panoramic semantic understanding. Our VLN system outperforms previous SOTA monocular VLN methods in R2R-CE and RxR-CE benchmarks within the simulation environments and is also validated in real-world environments.
Score: 38.04404612393027
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-language navigation (VLN) enables the agent to navigate to a remote location in 3D environments following the natural language instruction. In this field, the agent is usually trained and evaluated in the navigation simulators, lacking effective approaches for sim-to-real transfer. The VLN agents with only a monocular camera exhibit extremely limited performance, while the mainstream VLN models trained with panoramic observation, perform better but are difficult to deploy on most monocular robots. For this case, we propose a sim-to-real transfer approach to endow the monocular robots with panoramic traversability perception and panoramic semantic understanding, thus smoothly transferring the high-performance panoramic VLN models to the common monocular robots. In this work, the semantic traversable map is proposed to predict agent-centric navigable waypoints, and the novel view representations of these navigable waypoints are predicted through the 3D feature fields. These methods broaden the limited field of view of the monocular robots and significantly improve navigation performance in the real world. Our VLN system outperforms previous SOTA monocular VLN methods in R2R-CE and RxR-CE benchmarks within the simulation environments and is also validated in real-world environments, providing a practical and high-performance solution for real-world VLN.

Related papers

SkyVLN: Vision-and-Language Navigation and NMPC Control for UAVs in Urban Environments [7.251041314934871]
Unmanned Aerial Vehicles (UAVs) have emerged as versatile tools across various sectors, driven by their mobility and adaptability.<n>This paper introduces SkyVLN, a novel framework integrating verbalize vision-and-language navigation (VLN) with Model Predictive Control (NMPC) to enhance UAV autonomy in complex urban environments.
arXiv Detail & Related papers (2025-07-09T05:38:32Z)
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z)
Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments [10.953629652228024]
Vision-and-Language Navigation (VLN) agents associate time-sequenced visual observations with corresponding instructions to make decisions. In this paper, we address the mismatch between human-centric instructions and quadruped robots with a low-height field of view. We propose a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue.
arXiv Detail & Related papers (2025-02-26T10:30:40Z)
OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation [49.697035403548966]
Vision-Language Navigation (VLN) aims to guide agents by leveraging language instructions and visual cues, playing a pivotal role in embodied AI.<n>We propose OpenFly, a platform comprising various rendering engines, a versatile toolchain, and a large-scale benchmark for aerial VLN.<n>We construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes.
arXiv Detail & Related papers (2025-02-25T09:57:18Z)
AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans [2.940962519388297]
We propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN) AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles. We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.
arXiv Detail & Related papers (2024-11-27T17:36:08Z)
UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN. It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z)
Enhancing Autonomous Navigation by Imaging Hidden Objects using Single-Photon LiDAR [12.183773707869069]
We present a novel approach that leverages Non-Line-of-Sight (NLOS) sensing using single-photon LiDAR to improve visibility and enhance autonomous navigation. Our method enables mobile robots to "see around corners" by utilizing multi-bounce light information.
arXiv Detail & Related papers (2024-10-04T16:03:13Z)
Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions [69.9980759344628]
Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. We introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities. We present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies.
arXiv Detail & Related papers (2024-06-27T15:01:42Z)
Gaussian Splatting to Real World Flight Navigation Transfer with Liquid Networks [93.38375271826202]
We present a method to improve generalization and robustness to distribution shifts in sim-to-real visual quadrotor navigation tasks. We first build a simulator by integrating Gaussian splatting with quadrotor flight dynamics, and then, train robust navigation policies using Liquid neural networks. In this way, we obtain a full-stack imitation learning protocol that combines advances in 3D Gaussian splatting radiance field rendering, programming of expert demonstration training data, and the task understanding capabilities of Liquid networks.
arXiv Detail & Related papers (2024-06-21T13:48:37Z)
CityNav: A Large-Scale Dataset for Real-World Aerial Navigation [25.51740922661166]
We introduce CityNav, the first large-scale real-world dataset for aerial VLN.<n>Our dataset consists of 32,637 human demonstration trajectories, each paired with a natural language description.<n>We provide a methodology of creating geographic semantic maps that can be used as an auxiliary modality input during navigation.
arXiv Detail & Related papers (2024-06-20T12:08:27Z)
Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions. We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells. VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z)
Image-based Navigation in Real-World Environments via Multiple Mid-level Representations: Fusion Models, Benchmark and Efficient Evaluation [13.207579081178716]
In recent learning-based navigation approaches, the scene understanding and navigation abilities of the agent are achieved simultaneously. Unfortunately, even if simulators represent an efficient tool to train navigation policies, the resulting models often fail when transferred into the real world. One possible solution is to provide the navigation model with mid-level visual representations containing important domain-invariant properties of the scene.
arXiv Detail & Related papers (2022-02-02T15:00:44Z)
On Embodied Visual Navigation in Real Environments Through Habitat [20.630139085937586]
Visual navigation models based on deep learning can learn effective policies when trained on large amounts of visual observations. To deal with this limitation, several simulation platforms have been proposed in order to train visual navigation policies on virtual environments efficiently. We show that our tool can effectively help to train and evaluate navigation policies on real-world observations without running navigation pisodes in the real world.
arXiv Detail & Related papers (2020-10-26T09:19:07Z)
Transferable Active Grasping and Real Embodied Dataset [48.887567134129306]
We show how to search for feasible viewpoints for grasping by the use of hand-mounted RGB-D cameras. A practical 3-stage transferable active grasping pipeline is developed, that is adaptive to unseen clutter scenes. In our pipeline, we propose a novel mask-guided reward to overcome the sparse reward issue in grasping and ensure category-irrelevant behavior.
arXiv Detail & Related papers (2020-04-28T08:15:35Z)
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal. One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments. We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.