Foundation Model is Efficient Multimodal Multitask Model Selector
- URL: http://arxiv.org/abs/2308.06262v1
- Date: Fri, 11 Aug 2023 17:54:44 GMT
- Title: Foundation Model is Efficient Multimodal Multitask Model Selector
- Authors: Fanqing Meng, Wenqi Shao, Zhanglin Peng, Chonghe Jiang, Kaipeng Zhang,
Yu Qiao, Ping Luo
- Abstract summary: A brute-force approach is to finetune all models on all target datasets, bringing high computational costs.
We propose an efficient multi-task model selector (EMMS) to transform diverse label formats into a unified noisy label embedding.
EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario.
- Score: 47.017463595702274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates an under-explored but important problem: given a
collection of pre-trained neural networks, predicting their performance on each
multi-modal task without fine-tuning them, such as image recognition,
referring, captioning, visual question answering, and text question answering.
A brute-force approach is to finetune all models on all target datasets,
bringing high computational costs. Although recent-advanced approaches employed
lightweight metrics to measure models' transferability,they often depend
heavily on the prior knowledge of a single task, making them inapplicable in a
multi-modal multi-task scenario. To tackle this issue, we propose an efficient
multi-task model selector (EMMS), which employs large-scale foundation models
to transform diverse label formats such as categories, texts, and bounding
boxes of different downstream tasks into a unified noisy label embedding. EMMS
can estimate a model's transferability through a simple weighted linear
regression, which can be efficiently solved by an alternating minimization
algorithm with a convergence guarantee. Extensive experiments on 5 downstream
tasks with 24 datasets show that EMMS is fast, effective, and generic enough to
assess the transferability of pre-trained models, making it the first model
selection method in the multi-task scenario. For instance, compared with the
state-of-the-art method LogME enhanced by our label embeddings, EMMS achieves
9.0\%, 26.3\%, 20.1\%, 54.8\%, 12.2\% performance gain on image recognition,
referring, captioning, visual question answering, and text question answering,
while bringing 5.13x, 6.29x, 3.59x, 6.19x, and 5.66x speedup in wall-clock
time, respectively. The code is available at
https://github.com/OpenGVLab/Multitask-Model-Selector.
Related papers
- Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z) - All Birds with One Stone: Multi-task Text Classification for Efficient
Inference with One Forward Pass [34.85886030306857]
In web content classification, multiple classification tasks are predicted from same input text such as a web article.
Existing multitask transformer models need to conduct N forward passes for N tasks with O(N) cost.
We propose a scalable method that can achieve stronger performance with close to O(1) computation cost via only one forward pass.
arXiv Detail & Related papers (2022-05-22T05:16:03Z) - Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning
in NLP Using Fewer Parameters & Less Data [5.689320790746046]
Multi-Task Learning (MTL) networks have emerged as a promising method for transferring learned knowledge across different tasks.
However, MTL must deal with challenges such as: overfitting to low resource tasks, catastrophic forgetting, and negative task transfer.
We propose a novel Transformer architecture consisting of a new conditional attention mechanism and a set of task-conditioned modules.
arXiv Detail & Related papers (2020-09-19T02:04:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.