Related papers: OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

URL: http://arxiv.org/abs/2212.04408v1
Date: Thu, 8 Dec 2022 17:07:09 GMT
Title: OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models
Authors: Jinze Bai, Rui Men, Hao Yang, Xuancheng Ren, Kai Dang, Yichang Zhang, Xiaohuan Zhou, Peng Wang, Sinan Tan, An Yang, Zeyu Cui, Yu Han, Shuai Bai, Wenbin Ge, Jianxin Ma, Junyang Lin, Jingren Zhou, Chang Zhou
Abstract summary: Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model. We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
Score: 72.8156832931841
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys

Related papers

M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving [48.17490295484055]
M3Net is a novel network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.
arXiv Detail & Related papers (2025-03-23T15:08:09Z)
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
Task Preference Optimization (TPO) is a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models.
arXiv Detail & Related papers (2024-12-26T18:56:05Z)
One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning [16.96824902454355]
We propose a unified framework that concurrently handles multiple tasks and modalities. In this framework, all modalities and tasks are represented as unified tokens and trained using a single, consistent approach. We present a new benchmark, MMUD, which includes samples annotated with multiple task labels. We demonstrate the ability to handle multiple tasks simultaneously in a streamlined and efficient manner.
arXiv Detail & Related papers (2024-08-06T07:19:51Z)
Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z)
Unified-modal Salient Object Detection via Adaptive Prompt Learning [18.90181500147265]
We propose a unified framework called UniSOD to address both single-modal and multi-modal SOD tasks. UniSOD learns modality-aware prompts with task-specific hints through adaptive prompt learning. Our method achieves overall performance improvement on 14 benchmark datasets for RGB, RGB-D, and RGB-T SOD.
arXiv Detail & Related papers (2023-11-28T14:51:08Z)
Towards Robust Multi-Modal Reasoning via Model Selection [7.6621866737827045]
LLM serves as the "brain" of the agent, orchestrating multiple tools for collaborative multi-step task solving. We propose the $textitM3$ framework as a plug-in with negligible runtime overhead at test-time. Our experiments reveal that our framework enables dynamic model selection, considering both user inputs and subtask dependencies.
arXiv Detail & Related papers (2023-10-12T16:06:18Z)
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z)
Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts [75.75548749888029]
We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks. With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.
arXiv Detail & Related papers (2023-05-11T17:57:49Z)
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z)
Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers. Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters. We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.