PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For
Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2305.11918v1
- Date: Fri, 19 May 2023 02:25:56 GMT
- Title: PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For
Vision-and-Language Navigation
- Authors: Liuyi Wang, Chengju Liu, Zongtao He, Shu Li, Qingqing Yan, Huiyi Chen,
Qijun Chen
- Abstract summary: Vision-and-language navigation (VLN) is a crucial but challenging cross-modal navigation task.
One powerful technique to enhance the performance in VLN is the use of an independent speaker model to provide pseudo instructions for data augmentation.
We propose a novel term-aware-temporal transformer speaker (PASTS) model that uses transformer as the core of the network.
- Score: 6.11362142120604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language navigation (VLN) is a crucial but challenging cross-modal
navigation task. One powerful technique to enhance the generalization
performance in VLN is the use of an independent speaker model to provide pseudo
instructions for data augmentation. However, current speaker models based on
Long-Short Term Memory (LSTM) lack the ability to attend to features relevant
at different locations and time steps. To address this, we propose a novel
progress-aware spatio-temporal transformer speaker (PASTS) model that uses the
transformer as the core of the network. PASTS uses a spatio-temporal encoder to
fuse panoramic representations and encode intermediate connections through
steps. Besides, to avoid the misalignment problem that could result in
incorrect supervision, a speaker progress monitor (SPM) is proposed to enable
the model to estimate the progress of instruction generation and facilitate
more fine-grained caption results. Additionally, a multifeature dropout (MFD)
strategy is introduced to alleviate overfitting. The proposed PASTS is flexible
to be combined with existing VLN models. The experimental results demonstrate
that PASTS outperforms all existing speaker models and successfully improves
the performance of previous VLN models, achieving state-of-the-art performance
on the standard Room-to-Room (R2R) dataset.
Related papers
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.
By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z) - Boosting Vision-Language Models with Transduction [12.281505126587048]
We present TransCLIP, a novel and computationally efficient transductive approach for vision-language models.
TransCLIP is applicable as a plug-and-play module on top of popular inductive zero- and few-shot models.
arXiv Detail & Related papers (2024-06-03T23:09:30Z) - Vision-and-Language Navigation Generative Pretrained Transformer [0.0]
Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT)
Adopts transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules.
Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
arXiv Detail & Related papers (2024-05-27T09:42:04Z) - Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - Temporal Transformer Networks with Self-Supervision for Action
Recognition [13.00827959393591]
We introduce a startling Temporal Transformer Network with Self-supervision (TTSN)
TTSN consists of a temporal transformer module and a temporal sequence self-supervision module.
Our proposed TTSN is promising as it successfully achieves state-of-the-art performance for action recognition.
arXiv Detail & Related papers (2021-12-14T12:53:53Z) - A Recurrent Vision-and-Language BERT for Navigation [54.059606864535304]
We propose a recurrent BERT model that is time-aware for use in vision-and-language navigation.
Our model can replace more complex encoder-decoder models to achieve state-of-the-art results.
arXiv Detail & Related papers (2020-11-26T00:23:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.