M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension
- URL: http://arxiv.org/abs/2407.01131v3
- Date: Sun, 16 Feb 2025 18:44:39 GMT
- Title: M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension
- Authors: Xuyang Liu, Ting Liu, Siteng Huang, Yi Xin, Yue Hu, Quanjun Yin, Donglin Wang, Yuanyuan Wu, Honggang Chen,
- Abstract summary: Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression.
We present M$2$IST: Multi-Modal Interactive Side-Tuning with M$3$ISAs: Mixture of Multi-Modal Interactive Side-Adapters.
- Score: 36.39848221201381
- License:
- Abstract: Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, directly applying PETL to REC faces two challenges: (1) insufficient multi-modal interaction between pre-trained vision-language foundation models, and (2) high GPU memory usage due to gradients passing through the heavy vision-language foundation models. To this end, we present M$^2$IST: Multi-Modal Interactive Side-Tuning with M$^3$ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we fix the pre-trained uni-modal encoders and update M$^3$ISAs to enable efficient vision-language alignment for REC. Empirical results reveal that M$^2$IST achieves better performance-efficiency trade-off than full fine-tuning and other PETL methods, requiring only 2.11% tunable parameters, 39.61% GPU memory, and 63.46% training time while maintaining competitive performance. Our code is released at https://github.com/xuyang-liu16/M2IST.
Related papers
- Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction [62.8375542401319]
Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone.
The number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs.
We propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep.
arXiv Detail & Related papers (2024-11-30T18:54:32Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension [14.98036475954174]
Referring Expressionvolution (REC) aims to ground a local visual region via natural language.
Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning.
We propose a novel framework of Multi Prior-guided Directly Efficient Tuning, namely MaPPER.
MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% backbone parameters.
arXiv Detail & Related papers (2024-09-20T16:12:26Z) - Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - Parameter-Efficient Transfer Learning for Remote Sensing Image-Text
Retrieval [10.84733740863356]
In this work, we investigate the parameter-efficient transfer learning (PETL) method to transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task.
Our proposed model only contains 0.16M training parameters, which can achieve a parameter reduction of 98.9% compared to full fine-tuning.
Our retrieval performance exceeds traditional methods by 7-13% and achieves comparable or better performance than full fine-tuning.
arXiv Detail & Related papers (2023-08-24T02:43:53Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.