When Parameter-efficient Tuning Meets General-purpose Vision-language
Models
- URL: http://arxiv.org/abs/2312.12458v1
- Date: Sat, 16 Dec 2023 17:13:08 GMT
- Title: When Parameter-efficient Tuning Meets General-purpose Vision-language
Models
- Authors: Yihang Zhai, Haixin Wang, Jianlong Chang, Xinlong Yang, Jinan Sun,
Shikun Zhang, Qi Tian
- Abstract summary: PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
- Score: 65.19127815275307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction tuning has shown promising potential for developing
general-purpose AI capabilities by using large-scale pre-trained models and
boosts growing research to integrate multimodal information for creative
applications. However, existing works still face two main limitations: the high
training costs and heavy computing resource dependence of full model
fine-tuning, and the lack of semantic information in instructions, which
hinders multimodal alignment. Addressing these challenges, this paper proposes
a novel approach to utilize Parameter-Efficient Tuning for generAl-purpose
vision-Language models, namely PETAL. PETAL revolutionizes the training process
by requiring only 0.5% of the total parameters, achieved through a unique mode
approximation technique, which significantly reduces the training costs and
reliance on heavy computing resources. Furthermore, PETAL enhances the semantic
depth of instructions in two innovative ways: 1) by introducing adaptive
instruction mixture-of-experts(MOEs), and 2) by fortifying the score-based
linkage between parameter-efficient tuning and mutual information. Our
extensive experiments across five multimodal downstream benchmarks reveal that
PETAL not only outperforms current state-of-the-art methods in most scenarios
but also surpasses full fine-tuning models in effectiveness. Additionally, our
approach demonstrates remarkable advantages in few-shot settings, backed by
comprehensive visualization analyses. Our source code is available at:
https://github. com/melonking32/PETAL.
Related papers
- Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains.
In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z) - Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning [12.00246872965739]
We propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model.
Our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model.
Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information.
arXiv Detail & Related papers (2024-04-16T18:22:49Z) - Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications.
MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling.
Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning [13.964106147449051]
Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets.
We propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT)
We demonstrate that our new approximations with semantic information are superior to representative capabilities.
arXiv Detail & Related papers (2024-02-04T04:42:05Z) - What Makes for Robust Multi-Modal Models in the Face of Missing
Modalities? [35.19295402483624]
We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective.
We introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA)
UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities.
arXiv Detail & Related papers (2023-10-10T07:47:57Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.