When Parameter-efficient Tuning Meets General-purpose Vision-language
Models
- URL: http://arxiv.org/abs/2312.12458v1
- Date: Sat, 16 Dec 2023 17:13:08 GMT
- Title: When Parameter-efficient Tuning Meets General-purpose Vision-language
Models
- Authors: Yihang Zhai, Haixin Wang, Jianlong Chang, Xinlong Yang, Jinan Sun,
Shikun Zhang, Qi Tian
- Abstract summary: PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
- Score: 65.19127815275307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction tuning has shown promising potential for developing
general-purpose AI capabilities by using large-scale pre-trained models and
boosts growing research to integrate multimodal information for creative
applications. However, existing works still face two main limitations: the high
training costs and heavy computing resource dependence of full model
fine-tuning, and the lack of semantic information in instructions, which
hinders multimodal alignment. Addressing these challenges, this paper proposes
a novel approach to utilize Parameter-Efficient Tuning for generAl-purpose
vision-Language models, namely PETAL. PETAL revolutionizes the training process
by requiring only 0.5% of the total parameters, achieved through a unique mode
approximation technique, which significantly reduces the training costs and
reliance on heavy computing resources. Furthermore, PETAL enhances the semantic
depth of instructions in two innovative ways: 1) by introducing adaptive
instruction mixture-of-experts(MOEs), and 2) by fortifying the score-based
linkage between parameter-efficient tuning and mutual information. Our
extensive experiments across five multimodal downstream benchmarks reveal that
PETAL not only outperforms current state-of-the-art methods in most scenarios
but also surpasses full fine-tuning models in effectiveness. Additionally, our
approach demonstrates remarkable advantages in few-shot settings, backed by
comprehensive visualization analyses. Our source code is available at:
https://github. com/melonking32/PETAL.
Related papers
- Refining Joint Text and Source Code Embeddings for Retrieval Task with Parameter-Efficient Fine-Tuning [0.0]
We propose a fine-tuning frame-work that leverages.
Efficient Fine-Tuning (PEFT) techniques.
We demonstrate that the proposed fine-tuning framework has the potential to improve code-text retrieval performance by tuning only 0.4% parameters at most.
arXiv Detail & Related papers (2024-05-07T08:50:25Z) - Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning [12.00246872965739]
We propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model.
Our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model.
Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information.
arXiv Detail & Related papers (2024-04-16T18:22:49Z) - Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications.
MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling.
Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning [13.964106147449051]
Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets.
We propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT)
We demonstrate that our new approximations with semantic information are superior to representative capabilities.
arXiv Detail & Related papers (2024-02-04T04:42:05Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - What Makes for Robust Multi-Modal Models in the Face of Missing
Modalities? [35.19295402483624]
We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective.
We introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA)
UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities.
arXiv Detail & Related papers (2023-10-10T07:47:57Z) - End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures.
We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.