Bridging the visual gap in VLN via semantically richer instructions
- URL: http://arxiv.org/abs/2210.15565v1
- Date: Thu, 27 Oct 2022 15:58:07 GMT
- Title: Bridging the visual gap in VLN via semantically richer instructions
- Authors: Joaquin Ossand\'on, Benjamin Earle, \'Alvaro Soto
- Abstract summary: We show that state-of-the-art models are not severely affected when they receive just limited or even no visual data.
We propose a new data augmentation method that fosters the inclusion of more explicit visual information.
- Score: 3.5789352263336847
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Visual-and-Language Navigation (VLN) task requires understanding a
textual instruction to navigate a natural indoor environment using only visual
information. While this is a trivial task for most humans, it is still an open
problem for AI models. In this work, we hypothesize that poor use of the visual
information available is at the core of the low performance of current models.
To support this hypothesis, we provide experimental evidence showing that
state-of-the-art models are not severely affected when they receive just
limited or even no visual data, indicating a strong overfitting to the textual
instructions. To encourage a more suitable use of the visual information, we
propose a new data augmentation method that fosters the inclusion of more
explicit visual information in the generation of textual navigational
instructions. Our main intuition is that current VLN datasets include textual
instructions that are intended to inform an expert navigator, such as a human,
but not a beginner visual navigational agent, such as a randomly initialized DL
model. Specifically, to bridge the visual semantic gap of current VLN datasets,
we take advantage of metadata available for the Matterport3D dataset that,
among others, includes information about object labels that are present in the
scenes. Training a state-of-the-art model with the new set of instructions
increase its performance by 8% in terms of success rate on unseen environments,
demonstrating the advantages of the proposed data augmentation method.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training [8.479135285935113]
Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation.
Most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects.
We propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP)
arXiv Detail & Related papers (2024-03-12T22:33:08Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Visual Data-Type Understanding does not emerge from Scaling
Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification.
An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z) - Ignorance is Bliss: Robust Control via Information Gating [60.17644038829572]
Informational parsimony provides a useful inductive bias for learning representations that achieve better generalization by being robust to noise and spurious correlations.
We propose textitinformation gating as a way to learn parsimonious representations that identify the minimal information required for a task.
arXiv Detail & Related papers (2023-03-10T18:31:50Z) - Understanding ME? Multimodal Evaluation for Fine-grained Visual
Commonsense [98.70218717851665]
It is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources.
We present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge.
We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation.
arXiv Detail & Related papers (2022-11-10T21:44:33Z) - Leveraging Unlabeled Data for Sketch-based Understanding [11.95015190261688]
We present a study about the use of unlabeled data to improve a sketch-based model.
Our results show the superiority of sketch-BYOL, which outperforms other self-supervised approaches.
arXiv Detail & Related papers (2022-04-26T18:13:30Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.