VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language
Navigation
- URL: http://arxiv.org/abs/2308.10172v1
- Date: Sun, 20 Aug 2023 05:55:30 GMT
- Title: VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language
Navigation
- Authors: Yanyuan Qiao, Zheng Yu, Qi Wu
- Abstract summary: We present the first study to explore PETL methods for VLN tasks and propose a VLN-specific PETL method named VLN-PETL.
VLN-PETL achieves comparable or even better performance to full fine-tuning and outperforms other PETL methods with promising margins.
- Score: 23.22586831122625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of the Vision-and-Language Navigation~(VLN) tasks has
witnessed rapid progress recently thanks to the use of large pre-trained
vision-and-language models. However, full fine-tuning the pre-trained model for
every downstream VLN task is becoming costly due to the considerable model
size. Recent research hotspot of Parameter-Efficient Transfer Learning (PETL)
shows great potential in efficiently tuning large pre-trained models for the
common CV and NLP tasks, which exploits the most of the representation
knowledge implied in the pre-trained model while only tunes a minimal set of
parameters. However, simply utilizing existing PETL methods for the more
challenging VLN tasks may bring non-trivial degeneration to the performance.
Therefore, we present the first study to explore PETL methods for VLN tasks and
propose a VLN-specific PETL method named VLN-PETL. Specifically, we design two
PETL modules: Historical Interaction Booster (HIB) and Cross-modal Interaction
Booster (CIB). Then we combine these two modules with several existing PETL
methods as the integrated VLN-PETL. Extensive experimental results on four
mainstream VLN tasks (R2R, REVERIE, NDH, RxR) demonstrate the effectiveness of
our proposed VLN-PETL, where VLN-PETL achieves comparable or even better
performance to full fine-tuning and outperforms other PETL methods with
promising margins.
Related papers
- Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - Unsupervised Domain Adaption Harnessing Vision-Language Pre-training [4.327763441385371]
This paper focuses on harnessing the power of Vision-Language Pre-training models in Unsupervised Domain Adaptation (UDA)
We propose a novel method called Cross-Modal Knowledge Distillation (CMKD)
Our proposed method outperforms existing techniques on standard benchmarks.
arXiv Detail & Related papers (2024-08-05T02:37:59Z) - Concept-skill Transferability-based Data Selection for Large Vision-Language Models [56.0725292404808]
We introduce COINCIDE, an effective and scalable data selection technique for training vision-language models.
We cluster the training data using internal activations from a small model, which identifies concept-skill compositions needed by a target LVLM.
Experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines.
arXiv Detail & Related papers (2024-06-16T16:15:20Z) - Scaling Vision-and-Language Navigation With Offline RL [35.624579441774685]
We introduce a new problem setup of VLN-ORL which studies VLN using suboptimal demonstration data.
We introduce a simple and effective reward-conditioned approach that can account for dataset suboptimality for training VLN agents.
Our experiments demonstrate that the proposed reward-conditioned approach leads to significant performance improvements.
arXiv Detail & Related papers (2024-03-27T11:13:20Z) - Continual Vision-and-Language Navigation [18.20829279972436]
Vision-and-Language Navigation (VLN) agents navigate to a destination using natural language instructions and the visual information they observe.
Existing methods for training VLN agents presuppose fixed datasets, leading to a significant limitation.
We present the Continual Vision-and-Language Navigation (CVLN) paradigm, designed to evaluate agents trained through a continual learning process.
arXiv Detail & Related papers (2024-03-22T09:15:36Z) - MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [27.930351465266515]
We propose a simple yet effective training strategy MoE-Tuning for LVLMs.
MoE-LLaVA, a MoE-based sparse LVLM architecture, uniquely activates only the top-k experts through routers.
Experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks.
arXiv Detail & Related papers (2024-01-29T08:13:40Z) - Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models.
We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z) - Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research.
In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks.
We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z) - Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large
Language Models [77.2078051555533]
We propose a novel and affordable solution for the effective VL adaption of large language models (LLMs)
Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters.
MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions.
arXiv Detail & Related papers (2023-05-24T11:06:15Z) - Towards a Unified View on Visual Parameter-Efficient Transfer Learning [96.99924127527002]
We propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off.
An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin.
arXiv Detail & Related papers (2022-10-03T09:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.