M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension
- URL: http://arxiv.org/abs/2407.01131v2
- Date: Tue, 29 Oct 2024 12:57:42 GMT
- Title: M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension
- Authors: Xuyang Liu, Ting Liu, Siteng Huang, Yi Xin, Yue Hu, Quanjun Yin, Donglin Wang, Honggang Chen,
- Abstract summary: Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression.
PETL methods have shown strong performance with fewer tunable parameters.
We present M$2$IST: Multi-Modal Interactive Side-Tuning with M$3$ISAs: Mixture of Multi-Modal Interactive Side-Adapters.
- Score: 36.01063804442098
- License:
- Abstract: Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, directly applying PETL to REC faces two challenges: (1) insufficient multi-modal interaction between pre-trained vision-language foundation models, and (2) high GPU memory usage due to gradients passing through the heavy vision-language foundation models. To this end, we present M$^2$IST: Multi-Modal Interactive Side-Tuning with M$^3$ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we keep the pre-trained uni-modal encoders fixed, updating M$^3$ISAs on side networks to progressively connect them, enabling more comprehensive vision-language alignment and efficient tuning for REC. Empirical results reveal that M$^2$IST achieves an optimal balance between performance and efficiency compared to most full fine-tuning and other PETL methods. With M$^2$IST, standard transformer-based REC methods present competitive or even superior performance compared to full fine-tuning, while utilizing only 2.11\% of the tunable parameters, 39.61\% of the GPU memory, and 63.46\% of the fine-tuning time required for full fine-tuning.
Related papers
- EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains.
In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z) - MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension [14.98036475954174]
Referring Expressionvolution (REC) aims to ground a local visual region via natural language.
Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning.
We propose a novel framework of Multi Prior-guided Directly Efficient Tuning, namely MaPPER.
MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% backbone parameters.
arXiv Detail & Related papers (2024-09-20T16:12:26Z) - CROME: Cross-Modal Adapters for Efficient Multimodal LLM [28.337072921099494]
Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities.
Existing approaches often necessitate expensive language model retraining and limited adaptability.
We propose CROME, an efficient vision-language instruction tuning framework.
arXiv Detail & Related papers (2024-08-13T03:45:11Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Parameter-Efficient Transfer Learning for Remote Sensing Image-Text
Retrieval [10.84733740863356]
In this work, we investigate the parameter-efficient transfer learning (PETL) method to transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task.
Our proposed model only contains 0.16M training parameters, which can achieve a parameter reduction of 98.9% compared to full fine-tuning.
Our retrieval performance exceeds traditional methods by 7-13% and achieves comparable or better performance than full fine-tuning.
arXiv Detail & Related papers (2023-08-24T02:43:53Z) - Towards Efficient Visual Adaption via Structural Re-parameterization [76.57083043547296]
We propose a parameter-efficient and computational friendly adapter for giant vision models, called RepAdapter.
RepAdapter outperforms full tuning by +7.2% on average and saves up to 25% training time, 20% GPU memory, and 94.6% storage cost of ViT-B/16 on VTAB-1k.
arXiv Detail & Related papers (2023-02-16T06:14:15Z) - Resource-Efficient Transfer Learning From Speech Foundation Model Using
Hierarchical Feature Fusion [44.056153052137674]
We propose a novel hierarchical feature fusion method for resource-efficient transfer learning from speech foundation models.
Experimental results show that the proposed method can achieve better performance on speech recognition task than existing algorithms.
arXiv Detail & Related papers (2022-11-04T19:03:45Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.