Multimodal Transformer with Variable-length Memory for
Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2111.05759v1
- Date: Wed, 10 Nov 2021 16:04:49 GMT
- Title: Multimodal Transformer with Variable-length Memory for
Vision-and-Language Navigation
- Authors: Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari,
Zehuan Yuan
- Abstract summary: Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position.
Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction.
We introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation.
- Score: 79.1669476932147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Navigation (VLN) is a task that an agent is required to
follow a language instruction to navigate to the goal position, which relies on
the ongoing interactions with the environment during moving. Recent
Transformer-based VLN methods have made great progress benefiting from the
direct connections between visual observations and the language instruction via
the multimodal cross-attention mechanism. However, these methods usually
represent temporal context as a fixed-length vector by using an LSTM decoder or
using manually designed hidden states to build a recurrent Transformer.
Considering a single fixed-length vector is often insufficient to capture
long-term temporal context, in this paper, we introduce Multimodal Transformer
with Variable-length Memory (MTVM) for visually-grounded natural language
navigation by modelling the temporal context explicitly. Specifically, MTVM
enables the agent to keep track of the navigation trajectory by directly
storing previous activations in a memory bank. To further boost the
performance, we propose a memory-aware consistency loss to help learn a better
joint representation of temporal context with random masked instructions. We
evaluate MTVM on popular R2R and CVDN datasets, and our model improves Success
Rate on R2R unseen validation and test set by 2% each, and reduce Goal Process
by 1.6m on CVDN test set.
Related papers
- Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation [28.16053631036079]
Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to locate an arbitrary number of target objects in a video.
We introduce a compact Transformer-based method, termed TenRMOT, to exploit the advantages of Transformer architecture.
TenRMOT demonstrates superior performance on both the referring multi-object tracking and the segmentation tasks.
arXiv Detail & Related papers (2024-10-17T11:07:05Z) - Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for
Referring Video Object Segmentation [44.952526831843386]
We propose a correlation-driven inter-frame interaction Transformer, dubbed BIFIT, to address these issues in RVOS.
Specifically, we design a lightweight plug-and-play inter-frame interaction module in the decoder.
A vision-ferring interaction is implemented before the Transformer to facilitate the correlation between the visual and linguistic features.
arXiv Detail & Related papers (2023-07-02T10:29:35Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval [60.454321238910474]
State-of-the-art video-text retrieval methods typically involve fully fine-tuning a pre-trained model on specific datasets.
We present our pioneering work that enables parameter-efficient VTR using a pre-trained model.
We propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text.
arXiv Detail & Related papers (2023-01-19T03:42:56Z) - Reinforced Structured State-Evolution for Vision-Language Navigation [42.46176089721314]
Vision-and-language Navigation (VLN) task requires an embodied agent to navigate to a remote location following a natural language instruction.
Previous methods usually adopt a sequence model (e.g., Transformer and LSTM) as the navigator.
We propose a novel Structured state-Evolution (SEvol) model to effectively maintain the environment layout clues for VLN.
arXiv Detail & Related papers (2022-04-20T07:51:20Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - History Aware Multimodal Transformer for Vision-and-Language Navigation [96.80655332881432]
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.
We introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making.
arXiv Detail & Related papers (2021-10-25T22:54:41Z) - Learning to Combine the Modalities of Language and Video for Temporal
Moment Localization [4.203274985072923]
Temporal moment localization aims to retrieve the best video segment matching a moment specified by a query.
We introduce a novel recurrent unit, cross-modal long short-term memory (CM-LSTM), by mimicking the human cognitive process of localizing temporal moments.
We also devise a two-stream attention mechanism for both attended and unattended video features by the input query to prevent necessary visual information from being neglected.
arXiv Detail & Related papers (2021-09-07T08:25:45Z) - Dynamic Context-guided Capsule Network for Multimodal Machine
Translation [131.37130887834667]
Multimodal machine translation (MMT) mainly focuses on enhancing text-only translation with visual features.
We propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT.
Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN.
arXiv Detail & Related papers (2020-09-04T06:18:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.