A Recurrent Vision-and-Language BERT for Navigation
- URL: http://arxiv.org/abs/2011.13922v2
- Date: Sun, 28 Mar 2021 11:45:58 GMT
- Title: A Recurrent Vision-and-Language BERT for Navigation
- Authors: Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, Stephen
Gould
- Abstract summary: We propose a recurrent BERT model that is time-aware for use in vision-and-language navigation.
Our model can replace more complex encoder-decoder models to achieve state-of-the-art results.
- Score: 54.059606864535304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accuracy of many visiolinguistic tasks has benefited significantly from the
application of vision-and-language(V&L) BERT. However, its application for the
task of vision-and-language navigation (VLN) remains limited. One reason for
this is the difficulty adapting the BERT architecture to the partially
observable Markov decision process present in VLN, requiring history-dependent
attention and decision making. In this paper we propose a recurrent BERT model
that is time-aware for use in VLN. Specifically, we equip the BERT model with a
recurrent function that maintains cross-modal state information for the agent.
Through extensive experiments on R2R and REVERIE we demonstrate that our model
can replace more complex encoder-decoder models to achieve state-of-the-art
results. Moreover, our approach can be generalised to other transformer-based
architectures, supports pre-training, and is capable of solving navigation and
referring expression tasks simultaneously.
Related papers
- Vision-and-Language Navigation Generative Pretrained Transformer [0.0]
Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT)
Adopts transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules.
Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
arXiv Detail & Related papers (2024-05-27T09:42:04Z) - OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation [96.46961207887722]
OVER-NAV aims to go over and beyond the current arts of IVLN techniques.
To fully exploit the interpreted navigation data, we introduce a structured representation, coded Omnigraph.
arXiv Detail & Related papers (2024-03-26T02:34:48Z) - PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For
Vision-and-Language Navigation [6.11362142120604]
Vision-and-language navigation (VLN) is a crucial but challenging cross-modal navigation task.
One powerful technique to enhance the performance in VLN is the use of an independent speaker model to provide pseudo instructions for data augmentation.
We propose a novel term-aware-temporal transformer speaker (PASTS) model that uses transformer as the core of the network.
arXiv Detail & Related papers (2023-05-19T02:25:56Z) - Local Slot Attention for Vision-and-Language Navigation [30.705802302315785]
Vision-and-language navigation (VLN) is a hot topic in the computer vision and natural language processing community.
We propose a slot-attention based module to incorporate information from segmentation of the same object.
Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.
arXiv Detail & Related papers (2022-06-17T09:21:26Z) - XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems
to Improve Language Understanding [73.24847320536813]
This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders.
Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU.
arXiv Detail & Related papers (2022-04-15T03:44:00Z) - Multimodal Transformer with Variable-length Memory for
Vision-and-Language Navigation [79.1669476932147]
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position.
Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction.
We introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation.
arXiv Detail & Related papers (2021-11-10T16:04:49Z) - Updater-Extractor Architecture for Inductive World State Representations [0.0]
We propose a transformer-based Updater-Extractor architecture and a training procedure that can work with sequences of arbitrary length.
We explicitly train the model to incorporate incoming information into its world state representation.
Empirically, we investigate the model performance on three different tasks, demonstrating its promise.
arXiv Detail & Related papers (2021-04-12T14:30:11Z) - VisBERT: Hidden-State Visualizations for Transformers [66.86452388524886]
We present VisBERT, a tool for visualizing the contextual token representations within BERT for the task of (multi-hop) Question Answering.
VisBERT enables users to get insights about the model's internal state and to explore its inference steps or potential shortcomings.
arXiv Detail & Related papers (2020-11-09T15:37:43Z) - Soft Expert Reward Learning for Vision-and-Language Navigation [94.86954695912125]
Vision-and-Language Navigation (VLN) requires an agent to find a specified spot in an unseen environment by following natural language instructions.
We introduce a Soft Expert Reward Learning (SERL) model to overcome the reward engineering designing and generalisation problems of the VLN task.
arXiv Detail & Related papers (2020-07-21T14:17:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.