Related papers: Foundation Model is Efficient Multimodal Multitask Model Selector

Foundation Model is Efficient Multimodal Multitask Model Selector

URL: http://arxiv.org/abs/2308.06262v1
Date: Fri, 11 Aug 2023 17:54:44 GMT
Title: Foundation Model is Efficient Multimodal Multitask Model Selector
Authors: Fanqing Meng, Wenqi Shao, Zhanglin Peng, Chonghe Jiang, Kaipeng Zhang, Yu Qiao, Ping Luo
Abstract summary: A brute-force approach is to finetune all models on all target datasets, bringing high computational costs. We propose an efficient multi-task model selector (EMMS) to transform diverse label formats into a unified noisy label embedding. EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario.
Score: 47.017463595702274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering. A brute-force approach is to finetune all models on all target datasets, bringing high computational costs. Although recent-advanced approaches employed lightweight metrics to measure models' transferability,they often depend heavily on the prior knowledge of a single task, making them inapplicable in a multi-modal multi-task scenario. To tackle this issue, we propose an efficient multi-task model selector (EMMS), which employs large-scale foundation models to transform diverse label formats such as categories, texts, and bounding boxes of different downstream tasks into a unified noisy label embedding. EMMS can estimate a model's transferability through a simple weighted linear regression, which can be efficiently solved by an alternating minimization algorithm with a convergence guarantee. Extensive experiments on 5 downstream tasks with 24 datasets show that EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario. For instance, compared with the state-of-the-art method LogME enhanced by our label embeddings, EMMS achieves 9.0\%, 26.3\%, 20.1\%, 54.8\%, 12.2\% performance gain on image recognition, referring, captioning, visual question answering, and text question answering, while bringing 5.13x, 6.29x, 3.59x, 6.19x, and 5.66x speedup in wall-clock time, respectively. The code is available at https://github.com/OpenGVLab/Multitask-Model-Selector.

Related papers

Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning [46.51245338355645]
Model merging is a flexible and computationally tractable approach to merge single-task checkpoints into a multi-task model. We show that it qualitatively differs from the single-input-multiple-output model merging settings studied in the literature due to the existence of task-specific decoders. We propose two simple and efficient fixes for the SIMO setting to re-align the feature representation after merging.
arXiv Detail & Related papers (2025-04-15T15:10:46Z)
Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs. Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction. We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z)
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z)
An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z)
Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis. For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z)
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z)
OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model. We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z)
All Birds with One Stone: Multi-task Text Classification for Efficient Inference with One Forward Pass [34.85886030306857]
In web content classification, multiple classification tasks are predicted from same input text such as a web article. Existing multitask transformer models need to conduct N forward passes for N tasks with O(N) cost. We propose a scalable method that can achieve stronger performance with close to O(1) computation cost via only one forward pass.
arXiv Detail & Related papers (2022-05-22T05:16:03Z)
Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data [5.689320790746046]
Multi-Task Learning (MTL) networks have emerged as a promising method for transferring learned knowledge across different tasks. However, MTL must deal with challenges such as: overfitting to low resource tasks, catastrophic forgetting, and negative task transfer. We propose a novel Transformer architecture consisting of a new conditional attention mechanism and a set of task-conditioned modules.
arXiv Detail & Related papers (2020-09-19T02:04:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.