Parameter-Efficient Transfer Learning for Remote Sensing Image-Text
Retrieval
- URL: http://arxiv.org/abs/2308.12509v1
- Date: Thu, 24 Aug 2023 02:43:53 GMT
- Title: Parameter-Efficient Transfer Learning for Remote Sensing Image-Text
Retrieval
- Authors: Yuan Yuan, Yang Zhan, and Zhitong Xiong
- Abstract summary: In this work, we investigate the parameter-efficient transfer learning (PETL) method to transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task.
Our proposed model only contains 0.16M training parameters, which can achieve a parameter reduction of 98.9% compared to full fine-tuning.
Our retrieval performance exceeds traditional methods by 7-13% and achieves comparable or better performance than full fine-tuning.
- Score: 10.84733740863356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language pre-training (VLP) models have experienced a surge in
popularity recently. By fine-tuning them on specific datasets, significant
performance improvements have been observed in various tasks. However, full
fine-tuning of VLP models not only consumes a significant amount of
computational resources but also has a significant environmental impact.
Moreover, as remote sensing (RS) data is constantly being updated, full
fine-tuning may not be practical for real-world applications. To address this
issue, in this work, we investigate the parameter-efficient transfer learning
(PETL) method to effectively and efficiently transfer visual-language knowledge
from the natural domain to the RS domain on the image-text retrieval task. To
this end, we make the following contributions. 1) We construct a novel and
sophisticated PETL framework for the RS image-text retrieval (RSITR) task,
which includes the pretrained CLIP model, a multimodal remote sensing adapter,
and a hybrid multi-modal contrastive (HMMC) learning objective; 2) To deal with
the problem of high intra-modal similarity in RS data, we design a simple yet
effective HMMC loss; 3) We provide comprehensive empirical studies for
PETL-based RS image-text retrieval. Our results demonstrate that the proposed
method is promising and of great potential for practical applications. 4) We
benchmark extensive state-of-the-art PETL methods on the RSITR task. Our
proposed model only contains 0.16M training parameters, which can achieve a
parameter reduction of 98.9% compared to full fine-tuning, resulting in
substantial savings in training costs. Our retrieval performance exceeds
traditional methods by 7-13% and achieves comparable or better performance than
full fine-tuning. This work can provide new ideas and useful insights for RS
vision-language tasks.
Related papers
- MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension [14.98036475954174]
Referring Expressionvolution (REC) aims to ground a local visual region via natural language.
Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning.
We propose a novel framework of Multi Prior-guided Directly Efficient Tuning, namely MaPPER.
MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% backbone parameters.
arXiv Detail & Related papers (2024-09-20T16:12:26Z) - When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective [57.05315507519704]
We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing.
Our measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%.
arXiv Detail & Related papers (2024-09-03T12:03:45Z) - Efficient and Versatile Robust Fine-Tuning of Zero-shot Models [34.27380518351181]
We introduce Robust Adapter (R-Adapter), a novel method for fine-tuning zero-shot models to downstream tasks.
Our method integrates lightweight modules into the pre-trained model and employs novel self-ensemble techniques to boost OOD robustness and reduce storage expenses substantially.
Our experiments demonstrate that R-Adapter achieves state-of-the-art performance across a diverse set of tasks, tuning only 13% of the parameters of the CLIP encoders.
arXiv Detail & Related papers (2024-08-11T11:37:43Z) - Refining Joint Text and Source Code Embeddings for Retrieval Task with Parameter-Efficient Fine-Tuning [0.0]
We propose a fine-tuning frame-work that leverages.
Efficient Fine-Tuning (PEFT) techniques.
We demonstrate that the proposed fine-tuning framework has the potential to improve code-text retrieval performance by tuning only 0.4% parameters at most.
arXiv Detail & Related papers (2024-05-07T08:50:25Z) - Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning [13.964106147449051]
Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets.
We propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT)
We demonstrate that our new approximations with semantic information are superior to representative capabilities.
arXiv Detail & Related papers (2024-02-04T04:42:05Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models.
We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z) - Towards Efficient Visual Adaption via Structural Re-parameterization [76.57083043547296]
We propose a parameter-efficient and computational friendly adapter for giant vision models, called RepAdapter.
RepAdapter outperforms full tuning by +7.2% on average and saves up to 25% training time, 20% GPU memory, and 94.6% storage cost of ViT-B/16 on VTAB-1k.
arXiv Detail & Related papers (2023-02-16T06:14:15Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - A Practical Contrastive Learning Framework for Single-Image
Super-Resolution [51.422185656787285]
We investigate contrastive learning-based single image super-resolution from two perspectives.
We propose a practical contrastive learning framework for SISR, named PCL-SR.
Compared with existing benchmark methods, we re-train them by our proposed PCL-SR framework and achieve superior performance.
arXiv Detail & Related papers (2021-11-27T15:42:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.