Vision-Language Navigation with Random Environmental Mixup
- URL: http://arxiv.org/abs/2106.07876v1
- Date: Tue, 15 Jun 2021 04:34:26 GMT
- Title: Vision-Language Navigation with Random Environmental Mixup
- Authors: Chong Liu and Fengda Zhu and Xiaojun Chang and Xiaodan Liang and
Yi-Dong Shen
- Abstract summary: Vision-language Navigation (VLN) tasks require an agent to navigate step-by-step while perceiving the visual observations and comprehending a natural language instruction.
Previous works have proposed various data augmentation methods to reduce data bias.
We propose the Random Environmental Mixup (REM) method, which generates cross-connected house scenes as augmented data via mixuping environment.
- Score: 112.94609558723518
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-language Navigation (VLN) tasks require an agent to navigate
step-by-step while perceiving the visual observations and comprehending a
natural language instruction. Large data bias, which is caused by the disparity
ratio between the small data scale and large navigation space, makes the VLN
task challenging. Previous works have proposed various data augmentation
methods to reduce data bias. However, these works do not explicitly reduce the
data bias across different house scenes. Therefore, the agent would overfit to
the seen scenes and achieve poor navigation performance in the unseen scenes.
To tackle this problem, we propose the Random Environmental Mixup (REM) method,
which generates cross-connected house scenes as augmented data via mixuping
environment. Specifically, we first select key viewpoints according to the room
connection graph for each scene. Then, we cross-connect the key views of
different scenes to construct augmented scenes. Finally, we generate augmented
instruction-path pairs in the cross-connected scenes. The experimental results
on benchmark datasets demonstrate that our augmentation data via REM help the
agent reduce its performance gap between the seen and unseen environment and
improve the overall performance, making our model the best existing approach on
the standard VLN benchmark.
Related papers
- DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [84.73092715537364]
In this paper, we study a new task of navigating to diverse target objects in a large number of scene types.
We build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning.
Our agent achieves a success rate that surpasses GPT-4o by over 20%.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost.
Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z) - Scaling Data Generation in Vision-and-Language Navigation [116.95534559103788]
We propose an effective paradigm for generating large-scale data for learning.
We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs.
Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
arXiv Detail & Related papers (2023-07-28T16:03:28Z) - Masked Path Modeling for Vision-and-Language Navigation [41.7517631477082]
Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments by following natural language instructions.
Previous approaches have attempted to address this issue by introducing additional supervision during training.
We introduce a masked path modeling (MPM) objective, which pretrains an agent using self-collected data for downstream navigation tasks.
arXiv Detail & Related papers (2023-05-23T17:20:20Z) - Ground then Navigate: Language-guided Navigation in Dynamic Scenes [13.870303451896248]
We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings.
We solve the problem by explicitly grounding the navigable regions corresponding to the textual command.
We provide extensive qualitative and quantitive empirical results to validate the efficacy of the proposed approach.
arXiv Detail & Related papers (2022-09-24T09:51:09Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - A Flow Base Bi-path Network for Cross-scene Video Crowd Understanding in
Aerial View [93.23947591795897]
In this paper, we strive to tackle the challenges and automatically understand the crowd from the visual data collected from drones.
To alleviate the background noise generated in cross-scene testing, a double-stream crowd counting model is proposed.
To tackle the crowd density estimation problem under extreme dark environments, we introduce synthetic data generated by game Grand Theft Auto V(GTAV)
arXiv Detail & Related papers (2020-09-29T01:48:24Z) - Take the Scenic Route: Improving Generalization in Vision-and-Language
Navigation [44.019674347733506]
We investigate the popular Room-to-Room (R2R) VLN benchmark and discover that what is important is not only the amount of data you synthesize, but also how you do it.
We find that shortest path sampling, which is used by both the R2R benchmark and existing augmentation methods, encode biases in the action space of the agent which we dub as action priors.
We then show that these action priors offer one explanation toward the poor generalization of existing works.
arXiv Detail & Related papers (2020-03-31T14:52:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.