Learning from Unlabeled 3D Environments for Vision-and-Language
Navigation
- URL: http://arxiv.org/abs/2208.11781v1
- Date: Wed, 24 Aug 2022 21:50:20 GMT
- Title: Learning from Unlabeled 3D Environments for Vision-and-Language
Navigation
- Authors: Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid,
Ivan Laptev
- Abstract summary: In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions.
We propose to automatically create a large-scale VLN dataset from 900 unlabeled 3D buildings from HM3D.
We experimentally demonstrate that HM3D-AutoVLN significantly increases the generalization ability of resulting VLN models.
- Score: 87.03299519917019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In vision-and-language navigation (VLN), an embodied agent is required to
navigate in realistic 3D environments following natural language instructions.
One major bottleneck for existing VLN approaches is the lack of sufficient
training data, resulting in unsatisfactory generalization to unseen
environments. While VLN data is typically collected manually, such an approach
is expensive and prevents scalability. In this work, we address the data
scarcity issue by proposing to automatically create a large-scale VLN dataset
from 900 unlabeled 3D buildings from HM3D. We generate a navigation graph for
each building and transfer object predictions from 2D to generate pseudo 3D
object labels by cross-view consistency. We then fine-tune a pretrained
language model using pseudo object labels as prompts to alleviate the
cross-modal gap in instruction generation. Our resulting HM3D-AutoVLN dataset
is an order of magnitude larger than existing VLN datasets in terms of
navigation environments and instructions. We experimentally demonstrate that
HM3D-AutoVLN significantly increases the generalization ability of resulting
VLN models. On the SPL metric, our approach improves over state of the art by
7.1% and 8.1% on the unseen validation splits of REVERIE and SOON datasets
respectively.
Related papers
- Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning [55.339257446600634]
We introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data.
We construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples.
Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks.
arXiv Detail & Related papers (2024-09-30T21:55:38Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - UniG3D: A Unified 3D Object Generation Dataset [75.49544172927749]
UniG3D is a unified 3D object generation dataset constructed by employing a universal data transformation pipeline on ShapeNet datasets.
This pipeline converts each raw 3D model into comprehensive multi-modal data representation.
The selection of data sources for our dataset is based on their scale and quality.
arXiv Detail & Related papers (2023-06-19T07:03:45Z) - Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene
Understanding [40.68012530554327]
We introduce a pretrained 3D backbone, called SST, for 3D indoor scene understanding.
We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity.
A series of extensive ablation studies further validate the scalability, generality, and superior performance enabled by our approach.
arXiv Detail & Related papers (2023-04-14T02:49:08Z) - Exploring Deep 3D Spatial Encodings for Large-Scale 3D Scene
Understanding [19.134536179555102]
We propose an alternative approach to overcome the limitations of CNN based approaches by encoding the spatial features of raw 3D point clouds into undirected graph models.
The proposed method achieves on par state-of-the-art accuracy with improved training time and model stability thus indicating strong potential for further research.
arXiv Detail & Related papers (2020-11-29T12:56:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.