OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models
- URL: http://arxiv.org/abs/2212.04408v1
- Date: Thu, 8 Dec 2022 17:07:09 GMT
- Title: OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models
- Authors: Jinze Bai, Rui Men, Hao Yang, Xuancheng Ren, Kai Dang, Yichang Zhang,
Xiaohuan Zhou, Peng Wang, Sinan Tan, An Yang, Zeyu Cui, Yu Han, Shuai Bai,
Wenbin Ge, Jianxin Ma, Junyang Lin, Jingren Zhou, Chang Zhou
- Abstract summary: Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
- Score: 72.8156832931841
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalist models, which are capable of performing diverse multi-modal tasks
in a task-agnostic way within a single model, have been explored recently.
Being, hopefully, an alternative to approaching general-purpose AI, existing
generalist models are still at an early stage, where modality and task coverage
is limited. To empower multi-modal task-scaling and speed up this line of
research, we release a generalist model learning system, OFASys, built on top
of a declarative task interface named multi-modal instruction. At the core of
OFASys is the idea of decoupling multi-modal task representations from the
underlying model implementations. In OFASys, a task involving multiple
modalities can be defined declaratively even with just a single line of code.
The system automatically generates task plans from such instructions for
training and inference. It also facilitates multi-task training for diverse
multi-modal workloads. As a starting point, we provide presets of 7 different
modalities and 23 highly-diverse example tasks in OFASys, with which we also
develop a first-in-kind, single model, OFA+, that can handle text, image,
speech, video, and motion data. The single OFA+ model achieves 95% performance
in average with only 16% parameters of 15 task-finetuned models, showcasing the
performance reliability of multi-modal task-scaling provided by OFASys.
Available at https://github.com/OFA-Sys/OFASys
Related papers
- Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Unified-modal Salient Object Detection via Adaptive Prompt Learning [18.90181500147265]
We propose a unified framework called UniSOD to address both single-modal and multi-modal SOD tasks.
UniSOD learns modality-aware prompts with task-specific hints through adaptive prompt learning.
Our method achieves overall performance improvement on 14 benchmark datasets for RGB, RGB-D, and RGB-T SOD.
arXiv Detail & Related papers (2023-11-28T14:51:08Z) - Towards Robust Multi-Modal Reasoning via Model Selection [7.6621866737827045]
LLM serves as the "brain" of the agent, orchestrating multiple tools for collaborative multi-step task solving.
We propose the $textitM3$ framework as a plug-in with negligible runtime overhead at test-time.
Our experiments reveal that our framework enables dynamic model selection, considering both user inputs and subtask dependencies.
arXiv Detail & Related papers (2023-10-12T16:06:18Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts [75.75548749888029]
We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks.
With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.
arXiv Detail & Related papers (2023-05-11T17:57:49Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers.
Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters.
We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.