PanoGen: Text-Conditioned Panoramic Environment Generation for
Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2305.19195v1
- Date: Tue, 30 May 2023 16:39:54 GMT
- Title: PanoGen: Text-Conditioned Panoramic Environment Generation for
Vision-and-Language Navigation
- Authors: Jialu Li, Mohit Bansal
- Abstract summary: Vision-and-Language Navigation (VLN) requires the agent to follow language instructions to navigate through 3D environments.
One main challenge in VLN is the limited availability of training environments, which makes it hard to generalize to new and unseen environments.
We propose PanoGen, a generation method that can potentially create an infinite number of diverse panoramic environments conditioned on text.
- Score: 96.8435716885159
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-and-Language Navigation (VLN) requires the agent to follow language
instructions to navigate through 3D environments. One main challenge in VLN is
the limited availability of photorealistic training environments, which makes
it hard to generalize to new and unseen environments. To address this problem,
we propose PanoGen, a generation method that can potentially create an infinite
number of diverse panoramic environments conditioned on text. Specifically, we
collect room descriptions by captioning the room images in existing
Matterport3D environments, and leverage a state-of-the-art text-to-image
diffusion model to generate the new panoramic environments. We use recursive
outpainting over the generated images to create consistent 360-degree panorama
views. Our new panoramic environments share similar semantic information with
the original environments by conditioning on text descriptions, which ensures
the co-occurrence of objects in the panorama follows human intuition, and
creates enough diversity in room appearance and layout with image outpainting.
Lastly, we explore two ways of utilizing PanoGen in VLN pre-training and
fine-tuning. We generate instructions for paths in our PanoGen environments
with a speaker built on a pre-trained vision-and-language model for VLN
pre-training, and augment the visual observation with our panoramic
environments during agents' fine-tuning to avoid overfitting to seen
environments. Empirically, learning with our PanoGen environments achieves the
new state-of-the-art on the Room-to-Room, Room-for-Room, and CVDN datasets.
Pre-training with our PanoGen speaker data is especially effective for CVDN,
which has under-specified instructions and needs commonsense knowledge. Lastly,
we show that the agent can benefit from training with more generated panoramic
environments, suggesting promising results for scaling up the PanoGen
environments.
Related papers
- DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion [60.45000652592418]
We propose a novel text-driven panoramic generation framework, DiffPano, to achieve scalable, consistent, and diverse panoramic scene generation.
We show that DiffPano can generate consistent, diverse panoramic images with given unseen text descriptions and camera poses.
arXiv Detail & Related papers (2024-10-31T17:57:02Z) - Bird's-Eye-View Scene Graph for Vision-Language Navigation [85.72725920024578]
Vision-language navigation (VLN) entails an agent to navigate 3D environments following human instructions.
We present a BEV Scene Graph (BSG), which leverages multi-step BEV representations to encode scene layouts and geometric cues of indoor environment.
Based on BSG, the agent predicts a local BEV grid-level decision score and a global graph-level decision score, combined with a sub-view selection score on panoramic views.
arXiv Detail & Related papers (2023-08-09T07:48:20Z) - PanoViT: Vision Transformer for Room Layout Estimation from a Single
Panoramic Image [11.053777620735175]
PanoViT is a panorama vision transformer to estimate the room layout from a single panoramic image.
Compared to CNN models, our PanoViT is more proficient in learning global information from the panoramic image.
Our method outperforms state-of-the-art solutions in room layout prediction accuracy.
arXiv Detail & Related papers (2022-12-23T05:37:11Z) - Panoramic Panoptic Segmentation: Insights Into Surrounding Parsing for
Mobile Agents via Unsupervised Contrastive Learning [93.6645991946674]
We introduce panoramic panoptic segmentation, as the most holistic scene understanding.
A complete surrounding understanding provides a maximum of information to a mobile agent.
We propose a framework which allows model training on standard pinhole images and transfers the learned features to a different domain.
arXiv Detail & Related papers (2022-06-21T20:07:15Z) - EnvEdit: Environment Editing for Vision-and-Language Navigation [98.30038910061894]
In Vision-and-Language Navigation (VLN), an agent needs to navigate through the environment based on natural language instructions.
We propose EnvEdit, a data augmentation method that creates new environments by editing existing environments.
We show that our proposed EnvEdit method gets significant improvements in all metrics on both pre-trained and non-pre-trained VLN agents.
arXiv Detail & Related papers (2022-03-29T15:44:32Z) - DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene
Context Graph and Relation-based Optimization [66.25948693095604]
We propose a novel method for panoramic 3D scene understanding which recovers the 3D room layout and the shape, pose, position, and semantic category for each object from a single full-view panorama image.
Experiments demonstrate that our method outperforms existing methods on panoramic scene understanding in terms of both geometry accuracy and object arrangement.
arXiv Detail & Related papers (2021-08-24T13:55:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.