PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2503.09938v1
- Date: Thu, 13 Mar 2025 01:16:58 GMT
- Title: PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation
- Authors: Sen Wang, Dongliang Zhou, Liang Xie, Chao Xu, Ye Yan, Erwei Yin,
- Abstract summary: PanoGen++ is a framework that generates varied and pertinent panoramic environments for vision-and-language navigation tasks.<n>It incorporates pre-trained diffusion models with domain-specific fine-tuning, employing parameter-efficient techniques such as low-rank adaptation to minimize computational costs.<n>PanoGen++ augments the diversity and relevance of training environments, resulting in improved generalization and efficacy in VLN tasks.
- Score: 11.936832709051082
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language navigation (VLN) tasks require agents to navigate three-dimensional environments guided by natural language instructions, offering substantial potential for diverse applications. However, the scarcity of training data impedes progress in this field. This paper introduces PanoGen++, a novel framework that addresses this limitation by generating varied and pertinent panoramic environments for VLN tasks. PanoGen++ incorporates pre-trained diffusion models with domain-specific fine-tuning, employing parameter-efficient techniques such as low-rank adaptation to minimize computational costs. We investigate two settings for environment generation: masked image inpainting and recursive image outpainting. The former maximizes novel environment creation by inpainting masked regions based on textual descriptions, while the latter facilitates agents' learning of spatial relationships within panoramas. Empirical evaluations on room-to-room (R2R), room-for-room (R4R), and cooperative vision-and-dialog navigation (CVDN) datasets reveal significant performance enhancements: a 2.44% increase in success rate on the R2R test leaderboard, a 0.63% improvement on the R4R validation unseen set, and a 0.75-meter enhancement in goal progress on the CVDN validation unseen set. PanoGen++ augments the diversity and relevance of training environments, resulting in improved generalization and efficacy in VLN tasks.
Related papers
- General Scene Adaptation for Vision-and-Language Navigation [19.215183093931785]
Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments.<n>We introduce GSA-VLN, a novel task requiring agents to execute navigation instructions within a specific scene and simultaneously adapt to it for improved performance over time.<n>We propose a new dataset, GSA-R2R, which significantly expands the diversity and quantity of environments and instructions for the R2R dataset.
arXiv Detail & Related papers (2025-01-29T03:57:56Z) - World-Consistent Data Generation for Vision-and-Language Navigation [52.08816337783936]
Vision-and-Language Navigation (VLN) is a challenging task that requires an agent to navigate through photorealistic environments following natural-language instructions.<n>One main obstacle existing in VLN is data scarcity, leading to poor generalization performance over unseen environments.<n>We propose the world-consistent data generation (WCGEN), an efficacious data-augmentation framework satisfying both diversity and world-consistency.
arXiv Detail & Related papers (2024-12-09T11:40:54Z) - LiOn-XA: Unsupervised Domain Adaptation via LiDAR-Only Cross-Modal Adversarial Training [61.26381389532653]
LiOn-XA is an unsupervised domain adaptation (UDA) approach that combines LiDAR-Only Cross-Modal (X) learning with Adversarial training for 3D LiDAR point cloud semantic segmentation.
Our experiments on 3 real-to-real adaptation scenarios demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-10-21T09:50:17Z) - Fast Context-Based Low-Light Image Enhancement via Neural Implicit Representations [6.113035634680655]
Current deep learning-based low-light image enhancement methods often struggle with high-resolution images.
We introduce a novel approach termed CoLIE, which redefines the enhancement process through mapping the 2D coordinates of an underexposed image to its illumination component.
arXiv Detail & Related papers (2024-07-17T11:51:52Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - PanoGen: Text-Conditioned Panoramic Environment Generation for
Vision-and-Language Navigation [96.8435716885159]
Vision-and-Language Navigation (VLN) requires the agent to follow language instructions to navigate through 3D environments.
One main challenge in VLN is the limited availability of training environments, which makes it hard to generalize to new and unseen environments.
We propose PanoGen, a generation method that can potentially create an infinite number of diverse panoramic environments conditioned on text.
arXiv Detail & Related papers (2023-05-30T16:39:54Z) - A New Path: Scaling Vision-and-Language Navigation with Synthetic
Instructions and Imitation Learning [70.14372215250535]
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments.
Given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding.
We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory.
The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets.
arXiv Detail & Related papers (2022-10-06T17:59:08Z) - Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions.
Recent methods explore pretraining to improve generalization of VLN agents.
We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.