MultiWay-Adapater: Adapting large-scale multi-modal models for scalable
image-text retrieval
- URL: http://arxiv.org/abs/2309.01516v3
- Date: Mon, 5 Feb 2024 22:43:45 GMT
- Title: MultiWay-Adapater: Adapting large-scale multi-modal models for scalable
image-text retrieval
- Authors: Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa
- Abstract summary: MultiWay-Adapter (MWA) is a novel framework featuring an 'Alignment Enhancer'
This enhancer deepens inter-modal alignment, enabling high transferability with minimal tuning effort.
Experiments show that unlike prior efficient tuning approaches, MWA maintains model effectiveness, while reducing training time by up-to 57%.
- Score: 4.4173427917548524
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As Multimodal Large Language Models (MLLMs) grow in size, adapting them to
specialized tasks becomes increasingly challenging due to high computational
and memory demands. Indeed, traditional fine-tuning methods are costly, due to
the need for extensive, task-specific training. While efficient adaptation
methods exist that aim to reduce these costs, in practice they suffer from
shallow inter-modal alignment, which severely hurts model effectiveness. To
tackle these computational challenges and improve inter-modal alignment, we
introduce the MultiWay-Adapter (MWA), a novel framework featuring an 'Alignment
Enhancer'. This enhancer deepens inter-modal alignment, enabling high
transferability with minimal tuning effort. Our experiments show that unlike
prior efficient tuning approaches, MWA maintains model effectiveness, while
reducing training time by up-to 57%. MWA is also lightweight, increasing model
size by only 2-3% (in terms of parameters) for state-of-the-art foundation
models like BEiT-3 Large. These results demonstrate that MWA provides an
efficient and effective adaptation method for MLLMs, significantly broadening
their applicability.
Related papers
- AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment [13.977849745488339]
AmoebaLLM is a novel framework designed to enable the instant derivation of large language models of arbitrary shapes.
AmoebaLLM significantly facilitates rapid deployment tailored to various platforms and applications.
arXiv Detail & Related papers (2024-11-15T22:02:28Z) - LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - CROME: Cross-Modal Adapters for Efficient Multimodal LLM [28.337072921099494]
Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities.
Existing approaches often necessitate expensive language model retraining and limited adaptability.
We propose CROME, an efficient vision-language instruction tuning framework.
arXiv Detail & Related papers (2024-08-13T03:45:11Z) - MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtend is an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models.
MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models.
arXiv Detail & Related papers (2024-08-07T02:28:37Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.