Module-wise Adaptive Distillation for Multimodality Foundation Models
- URL: http://arxiv.org/abs/2310.04550v1
- Date: Fri, 6 Oct 2023 19:24:00 GMT
- Title: Module-wise Adaptive Distillation for Multimodality Foundation Models
- Authors: Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo
Zhao, Boqing Gong, Tianyi Zhou
- Abstract summary: multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes.
One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer.
Motivated by our observation that certain architecture components, referred to as modules, contribute more significantly to the student's performance than others, we propose to track the contributions of individual modules by recording the loss decrement after distillation each module and choose the module with a greater contribution to distill more frequently.
- Score: 125.42414892566843
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained multimodal foundation models have demonstrated remarkable
generalizability but pose challenges for deployment due to their large sizes.
One effective approach to reducing their sizes is layerwise distillation,
wherein small student models are trained to match the hidden representations of
large teacher models at each layer. Motivated by our observation that certain
architecture components, referred to as modules, contribute more significantly
to the student's performance than others, we propose to track the contributions
of individual modules by recording the loss decrement after distillation each
module and choose the module with a greater contribution to distill more
frequently. Such an approach can be naturally formulated as a multi-armed
bandit (MAB) problem, where modules and loss decrements are considered as arms
and rewards, respectively. We then develop a modified-Thompson sampling
algorithm named OPTIMA to address the nonstationarity of module contributions
resulting from model updating. Specifically, we leverage the observed
contributions in recent history to estimate the changing contribution of each
module and select modules based on these estimations to maximize the cumulative
contribution. We evaluate the effectiveness of OPTIMA through distillation
experiments on various multimodal understanding and image captioning tasks,
using the CoCa-Large model (Yu et al., 2022) as the teacher model.
Related papers
- Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer.
We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging.
We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z) - Closed-form merging of parameter-efficient modules for Federated Continual Learning [9.940242741914748]
We introduce LoRM, an alternating optimization strategy that trains one LoRA matrix at a time.
This allows solving for each unknown variable individually, thus finding a unique solution.
Our method demonstrates state-of-the-art performance across a range of FCIL scenarios.
arXiv Detail & Related papers (2024-10-23T15:30:13Z) - SurgeryV2: Bridging the Gap Between Model Merging and Multi-Task Learning with Deep Representation Surgery [54.866490321241905]
Model merging-based multitask learning (MTL) offers a promising approach for performing MTL by merging multiple expert models.
In this paper, we examine the merged model's representation distribution and uncover a critical issue of "representation bias"
This bias arises from a significant distribution gap between the representations of the merged and expert models, leading to the suboptimal performance of the merged MTL model.
arXiv Detail & Related papers (2024-10-18T11:49:40Z) - Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation [59.37775534633868]
We present an extremely straightforward approach to transferring pre-trained, task-specific PEFT modules between same-family PLMs.
We also propose a method that allows the transfer of modules between incompatible PLMs without any change in the inference complexity.
arXiv Detail & Related papers (2024-03-27T17:50:00Z) - Representation Surgery for Multi-Task Model Merging [57.63643005215592]
Multi-task learning (MTL) compresses the information from multiple tasks into a unified backbone to improve computational efficiency and generalization.
Recent work directly merges multiple independently trained models to perform MTL instead of collecting their raw data for joint training.
By visualizing the representation distribution of existing model merging schemes, we find that the merged model often suffers from the dilemma of representation bias.
arXiv Detail & Related papers (2024-02-05T03:39:39Z) - R-Cut: Enhancing Explainability in Vision Transformers with Relationship
Weighted Out and Cut [14.382326829600283]
We introduce two modules: the Relationship Weighted Out" and the Cut" modules.
The Cut" module performs fine-grained feature decomposition, taking into account factors such as position, texture, and color.
We validate our method with extensive qualitative and quantitative experiments on the ImageNet dataset.
arXiv Detail & Related papers (2023-07-18T08:03:51Z) - Modular Deep Learning [120.36599591042908]
Transfer learning has recently become the dominant paradigm of machine learning.
It remains unclear how to develop models that specialise towards multiple tasks without incurring negative interference.
Modular deep learning has emerged as a promising solution to these challenges.
arXiv Detail & Related papers (2023-02-22T18:11:25Z) - Neural Network Module Decomposition and Recomposition [35.21448933547118]
We propose a modularization method that decomposes a deep neural network (DNN) into small modules from a functionality perspective.
We demonstrate that the proposed method can decompose and recompose DNNs with high compression ratio and high accuracy.
arXiv Detail & Related papers (2021-12-25T08:36:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.