Scaling Data Generation in Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2307.15644v2
- Date: Wed, 9 Aug 2023 23:51:22 GMT
- Title: Scaling Data Generation in Vision-and-Language Navigation
- Authors: Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen
Gould, Hao Tan, Yu Qiao
- Abstract summary: We propose an effective paradigm for generating large-scale data for learning.
We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs.
Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
- Score: 116.95534559103788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research in language-guided visual navigation has demonstrated a
significant demand for the diversity of traversable environments and the
quantity of supervision for training generalizable agents. To tackle the common
data scarcity issue in existing vision-and-language navigation datasets, we
propose an effective paradigm for generating large-scale data for learning,
which applies 1200+ photo-realistic environments from HM3D and Gibson datasets
and synthesizes 4.9 million instruction trajectory pairs using fully-accessible
resources on the web. Importantly, we investigate the influence of each
component in this paradigm on the agent's performance and study how to
adequately apply the augmented data to pre-train and fine-tune an agent. Thanks
to our large-scale dataset, the performance of an existing agent can be pushed
up (+11% absolute with regard to previous SoTA) to a significantly new best of
80% single-run success rate on the R2R test split by simple imitation learning.
The long-lasting generalization gap between navigating in seen and unseen
environments is also reduced to less than 1% (versus 8% in the previous best
method). Moreover, our paradigm also facilitates different models to achieve
new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous
environments.
Related papers
- UdeerLID+: Integrating LiDAR, Image, and Relative Depth with Semi-Supervised [12.440461420762265]
Road segmentation is a critical task for autonomous driving systems.
Our work introduces an innovative approach that integrates LiDAR point cloud data, visual image, and relative depth maps.
One of the primary challenges is the scarcity of large-scale, accurately labeled datasets.
arXiv Detail & Related papers (2024-09-10T03:57:30Z) - CLIPping the Deception: Adapting Vision-Language Models for Universal
Deepfake Detection [3.849401956130233]
We explore the effectiveness of pre-trained vision-language models (VLMs) when paired with recent adaptation methods for universal deepfake detection.
We employ only a single dataset (ProGAN) in order to adapt CLIP for deepfake detection.
The simple and lightweight Prompt Tuning based adaptation strategy outperforms the previous SOTA approach by 5.01% mAP and 6.61% accuracy.
arXiv Detail & Related papers (2024-02-20T11:26:42Z) - Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning [50.809769498312434]
We propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS)
Our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.
arXiv Detail & Related papers (2023-11-22T03:45:30Z) - Replication: Contrastive Learning and Data Augmentation in Traffic
Classification Using a Flowpic Input Representation [47.95762911696397]
We reproduce [16] on the same datasets and replicate its most salient aspect (the importance of data augmentation) on three additional public datasets.
While we confirm most of the original results, we also found a 20% accuracy drop on some of the investigated scenarios due to a data shift in the original dataset.
arXiv Detail & Related papers (2023-09-18T12:55:09Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Masked Path Modeling for Vision-and-Language Navigation [41.7517631477082]
Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments by following natural language instructions.
Previous approaches have attempted to address this issue by introducing additional supervision during training.
We introduce a masked path modeling (MPM) objective, which pretrains an agent using self-collected data for downstream navigation tasks.
arXiv Detail & Related papers (2023-05-23T17:20:20Z) - A New Path: Scaling Vision-and-Language Navigation with Synthetic
Instructions and Imitation Learning [70.14372215250535]
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments.
Given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding.
We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory.
The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets.
arXiv Detail & Related papers (2022-10-06T17:59:08Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic
Data [2.554905387213586]
We present a visual localization system that learns to estimate camera poses in the real world with the help of synthetic data.
To mitigate the data scarcity issue, we introduce TOPO-DataGen, a versatile synthetic data generation tool.
We also introduce CrossLoc, a cross-modal visual representation learning approach to pose estimation.
arXiv Detail & Related papers (2021-12-16T18:05:48Z) - Vision-Language Navigation with Random Environmental Mixup [112.94609558723518]
Vision-language Navigation (VLN) tasks require an agent to navigate step-by-step while perceiving the visual observations and comprehending a natural language instruction.
Previous works have proposed various data augmentation methods to reduce data bias.
We propose the Random Environmental Mixup (REM) method, which generates cross-connected house scenes as augmented data via mixuping environment.
arXiv Detail & Related papers (2021-06-15T04:34:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.