Merging Vision Transformers from Different Tasks and Domains
- URL: http://arxiv.org/abs/2312.16240v1
- Date: Mon, 25 Dec 2023 09:32:28 GMT
- Title: Merging Vision Transformers from Different Tasks and Domains
- Authors: Peng Ye, Chenyu Huang, Mingzhu Shen, Tao Chen, Yongqi Huang, Yuning
Zhang, Wanli Ouyang
- Abstract summary: This work targets to merge various Vision Transformers (ViTs) trained on different tasks (i.e., datasets with different object categories) or domains (i.e., datasets with the same categories but different environments) into one unified model.
Previous model merging works focus on either CNNs or NLP models, leaving the ViTs merging research untouched.
- Score: 46.40701388197936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work targets to merge various Vision Transformers (ViTs) trained on
different tasks (i.e., datasets with different object categories) or domains
(i.e., datasets with the same categories but different environments) into one
unified model, yielding still good performance on each task or domain. Previous
model merging works focus on either CNNs or NLP models, leaving the ViTs
merging research untouched. To fill this gap, we first explore and find that
existing model merging methods cannot well handle the merging of the whole ViT
models and still have improvement space. To enable the merging of the whole
ViT, we propose a simple-but-effective gating network that can both merge all
kinds of layers (e.g., Embedding, Norm, Attention, and MLP) and select the
suitable classifier. Specifically, the gating network is trained by unlabeled
datasets from all the tasks (domains), and predicts the probability of which
task (domain) the input belongs to for merging the models during inference. To
further boost the performance of the merged model, especially when the
difficulty of merging tasks increases, we design a novel metric of model weight
similarity, and utilize it to realize controllable and combined weight merging.
Comprehensive experiments on kinds of newly established benchmarks, validate
the superiority of the proposed ViT merging framework for different tasks and
domains. Our method can even merge beyond 10 ViT models from different vision
tasks with a negligible effect on the performance of each task.
Related papers
- Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging [21.918559935122786]
Model merging is a promising way to combine multiple task-specific models into a single multitask model without extra training.
Traditional model merging methods often show significant performance gaps compared to fine-tuned models.
We show that both shared and exclusive task-specific knowledge are crucial for merging performance, but directly merging exclusive knowledge hinders overall performance.
We propose Twin-Merging, a method that encompasses two principal stages: (1) modularizing knowledge into shared and exclusive components, with compression to reduce redundancy and enhance efficiency; (2) dynamically merging shared and task-specific knowledge based on the input.
arXiv Detail & Related papers (2024-06-17T02:31:55Z) - Training-Free Pretrained Model Merging [38.16269074353077]
We propose an innovative model merging framework, coined as merging under dual-space constraints (MuDSC)
In order to enhance usability, we have also incorporated adaptations for group structure, including Multi-Head Attention and Group Normalization.
arXiv Detail & Related papers (2024-03-04T06:19:27Z) - AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging)
It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data.
Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.