Local Slot Attention for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2206.08645v1
- Date: Fri, 17 Jun 2022 09:21:26 GMT
- Title: Local Slot Attention for Vision-and-Language Navigation
- Authors: Yifeng Zhuang, Qiang Sun, Yanwei Fu, Lifeng Chen, Xiangyang Sue
- Abstract summary: Vision-and-language navigation (VLN) is a hot topic in the computer vision and natural language processing community.
We propose a slot-attention based module to incorporate information from segmentation of the same object.
Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.
- Score: 30.705802302315785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-and-language navigation (VLN), a frontier study aiming to pave the way
for general-purpose robots, has been a hot topic in the computer vision and
natural language processing community. The VLN task requires an agent to
navigate to a goal location following natural language instructions in
unfamiliar environments.
Recently, transformer-based models have gained significant improvements on
the VLN task. Since the attention mechanism in the transformer architecture can
better integrate inter- and intra-modal information of vision and language.
However, there exist two problems in current transformer-based models.
1) The models process each view independently without taking the integrity of
the objects into account.
2) During the self-attention operation in the visual modality, the views that
are spatially distant can be inter-weaved with each other without explicit
restriction. This kind of mixing may introduce extra noise instead of useful
information.
To address these issues, we propose 1) A slot-attention based module to
incorporate information from segmentation of the same object. 2) A local
attention mask mechanism to limit the visual attention span. The proposed
modules can be easily plugged into any VLN architecture and we use the
Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that
our model has achieved the state-of-the-art results.
Related papers
- Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust [9.647148940880381]
Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies.
We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that dynamically identifies regions of the input image that the model is sensitive to.
We show that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds.
arXiv Detail & Related papers (2024-10-02T19:29:24Z) - LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models.
Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer.
We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z) - GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z) - A Dual Semantic-Aware Recurrent Global-Adaptive Network For
Vision-and-Language Navigation [3.809880620207714]
Vision-and-Language Navigation (VLN) is a realistic but challenging task that requires an agent to locate the target region using verbal and visual cues.
This work proposes a dual semantic-aware recurrent global-adaptive network (DSRG) to address the above problems.
arXiv Detail & Related papers (2023-05-05T15:06:08Z) - SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning.
The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily.
Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - Transferring ConvNet Features from Passive to Active Robot
Self-Localization: The Use of Ego-Centric and World-Centric Views [2.362412515574206]
A standard VPR subsystem is assumed to be available, and its domain-invariant state recognition ability is proposed to be transferred to train the domain-invariant NBV planner.
We divide the visual cues that are available from the CNN model into two types: the output layer cue (OLC) and intermediate layer cue (ILC)
In our framework, the ILC and OLC are mapped to a state vector and subsequently used to train a multiview NBV planner via deep reinforcement learning.
arXiv Detail & Related papers (2022-04-22T04:42:33Z) - VL-InterpreT: An Interactive Visualization Tool for Interpreting
Vision-Language Transformers [47.581265194864585]
Internal mechanisms of vision and multimodal transformers remain largely opaque.
With the success of these transformers, it is increasingly critical to understand their inner workings.
We propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers.
arXiv Detail & Related papers (2022-03-30T05:25:35Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - A Recurrent Vision-and-Language BERT for Navigation [54.059606864535304]
We propose a recurrent BERT model that is time-aware for use in vision-and-language navigation.
Our model can replace more complex encoder-decoder models to achieve state-of-the-art results.
arXiv Detail & Related papers (2020-11-26T00:23:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.