Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging
- URL: http://arxiv.org/abs/2410.21804v1
- Date: Tue, 29 Oct 2024 07:16:31 GMT
- Title: Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging
- Authors: Li Shen, Anke Tang, Enneng Yang, Guibing Guo, Yong Luo, Lefei Zhang, Xiaochun Cao, Bo Du, Dacheng Tao,
- Abstract summary: Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer.
We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging.
We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
- Score: 111.8456671452411
- License:
- Abstract: Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. Recent research on task arithmetic-based MTL demonstrates that merging the parameters of independently fine-tuned models can effectively achieve MTL. However, existing merging methods primarily seek a static optimal solution within the original model parameter space, which often results in performance degradation due to the inherent diversity among tasks and potential interferences. To address this challenge, in this paper, we propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. Specifically, we first identify critical (or sensitive) modules by analyzing parameter variations in core modules of Transformer-based models before and after finetuning. Then, our WEMoE statically merges non-critical modules while transforming critical modules into a mixture-of-experts (MoE) structure. During inference, expert modules in the MoE are dynamically merged based on input samples, enabling a more flexible and adaptive merging approach. Building on WEMoE, we further introduce an efficient-and-effective WEMoE (E-WEMoE) method, whose core mechanism involves eliminating non-essential elements in the critical modules of WEMoE and implementing shared routing across multiple MoE modules, thereby significantly reducing both the trainable parameters, the overall parameter count, and computational overhead of the merged model by WEMoE. Experimental results across various architectures and tasks demonstrate that both WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
Related papers
- Closed-form merging of parameter-efficient modules for Federated Continual Learning [9.940242741914748]
We introduce LoRM, an alternating optimization strategy that trains one LoRA matrix at a time.
This allows solving for each unknown variable individually, thus finding a unique solution.
Our method demonstrates state-of-the-art performance across a range of FCIL scenarios.
arXiv Detail & Related papers (2024-10-23T15:30:13Z) - EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - Representation Surgery for Multi-Task Model Merging [57.63643005215592]
Multi-task learning (MTL) compresses the information from multiple tasks into a unified backbone to improve computational efficiency and generalization.
Recent work directly merges multiple independently trained models to perform MTL instead of collecting their raw data for joint training.
By visualizing the representation distribution of existing model merging schemes, we find that the merged model often suffers from the dilemma of representation bias.
arXiv Detail & Related papers (2024-02-05T03:39:39Z) - Merging Multi-Task Models via Weight-Ensembling Mixture of Experts [64.94129594112557]
Merging Transformer-based models trained on different tasks into a single unified model can execute all the tasks concurrently.
Previous methods, exemplified by task arithmetic, have been proven to be both effective and scalable.
We propose to merge most of the parameters while upscaling the Transformer layers to a weight-ensembling mixture of experts (MoE) module.
arXiv Detail & Related papers (2024-02-01T08:58:57Z) - BYOM: Building Your Own Multi-Task Model For Free [69.63765907216442]
BYOM-FFT is for merging fully finetuned models, while BYOM-LoRA is for LoRA-finetuned models.
Experiments on computer vision and natural language processing tasks show that the proposed BYOM methods outperform existing merging methods by a large margin.
arXiv Detail & Related papers (2023-10-03T08:39:33Z) - MMSFormer: Multimodal Transformer for Material and Semantic Segmentation [16.17270247327955]
We propose a novel fusion strategy that can effectively fuse information from different modality combinations.
We also propose a new model named Multi-Modal TransFormer (MMSFormer) that incorporates the proposed fusion strategy.
MMSFormer outperforms current state-of-the-art models on three different datasets.
arXiv Detail & Related papers (2023-09-07T20:07:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.