OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models
- URL: http://arxiv.org/abs/2212.04408v1
- Date: Thu, 8 Dec 2022 17:07:09 GMT
- Title: OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models
- Authors: Jinze Bai, Rui Men, Hao Yang, Xuancheng Ren, Kai Dang, Yichang Zhang,
Xiaohuan Zhou, Peng Wang, Sinan Tan, An Yang, Zeyu Cui, Yu Han, Shuai Bai,
Wenbin Ge, Jianxin Ma, Junyang Lin, Jingren Zhou, Chang Zhou
- Abstract summary: Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
- Score: 72.8156832931841
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalist models, which are capable of performing diverse multi-modal tasks
in a task-agnostic way within a single model, have been explored recently.
Being, hopefully, an alternative to approaching general-purpose AI, existing
generalist models are still at an early stage, where modality and task coverage
is limited. To empower multi-modal task-scaling and speed up this line of
research, we release a generalist model learning system, OFASys, built on top
of a declarative task interface named multi-modal instruction. At the core of
OFASys is the idea of decoupling multi-modal task representations from the
underlying model implementations. In OFASys, a task involving multiple
modalities can be defined declaratively even with just a single line of code.
The system automatically generates task plans from such instructions for
training and inference. It also facilitates multi-task training for diverse
multi-modal workloads. As a starting point, we provide presets of 7 different
modalities and 23 highly-diverse example tasks in OFASys, with which we also
develop a first-in-kind, single model, OFA+, that can handle text, image,
speech, video, and motion data. The single OFA+ model achieves 95% performance
in average with only 16% parameters of 15 task-finetuned models, showcasing the
performance reliability of multi-modal task-scaling provided by OFASys.
Available at https://github.com/OFA-Sys/OFASys
Related papers
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
Task Preference Optimization (TPO) is a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks.
By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance.
Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models.
arXiv Detail & Related papers (2024-12-26T18:56:05Z) - One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning [16.96824902454355]
We propose a unified framework that concurrently handles multiple tasks and modalities.
In this framework, all modalities and tasks are represented as unified tokens and trained using a single, consistent approach.
We present a new benchmark, MMUD, which includes samples annotated with multiple task labels.
We demonstrate the ability to handle multiple tasks simultaneously in a streamlined and efficient manner.
arXiv Detail & Related papers (2024-08-06T07:19:51Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Unified-modal Salient Object Detection via Adaptive Prompt Learning [18.90181500147265]
We propose a unified framework called UniSOD to address both single-modal and multi-modal SOD tasks.
UniSOD learns modality-aware prompts with task-specific hints through adaptive prompt learning.
Our method achieves overall performance improvement on 14 benchmark datasets for RGB, RGB-D, and RGB-T SOD.
arXiv Detail & Related papers (2023-11-28T14:51:08Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts [75.75548749888029]
We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks.
With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.
arXiv Detail & Related papers (2023-05-11T17:57:49Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.