UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
- URL: http://arxiv.org/abs/2307.16184v2
- Date: Fri, 22 Dec 2023 18:07:41 GMT
- Title: UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
- Authors: Mustafa Shukor, Corentin Dancette, Alexandre Rame, Matthieu Cord
- Abstract summary: UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
- Score: 105.77733287326308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have made the ambitious quest for generalist
agents significantly far from being a fantasy. A key hurdle for building such
general models is the diversity and heterogeneity of tasks and modalities. A
promising solution is unification, allowing the support of a myriad of tasks
and modalities within one unified framework. While few large models (e.g.,
Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more
than two modalities, current small to mid-scale unified models are still
limited to 2 modalities, usually image-text or video-text. The question that we
ask is: is it possible to build efficiently a unified model that can support
all modalities? To answer this, we propose UnIVAL, a step further towards this
ambitious goal. Without relying on fancy datasets sizes or models with billions
of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities
and unifies text, images, video, and audio into a single model. Our model is
efficiently pretrained on many tasks, based on task balancing and multimodal
curriculum learning. UnIVAL shows competitive performance to existing
state-of-the-art approaches, across image and video-text tasks. The feature
representations learned from image and video-text modalities, allows the model
to achieve competitive performance when finetuned on audio-text tasks, despite
not being pretrained on audio. Thanks to the unified model, we propose a novel
study on multimodal model merging via weight interpolation of models trained on
different multimodal tasks, showing their benefits in particular for
out-of-distribution generalization. Finally, we motivate unification by showing
the synergy between tasks. The model weights and code are released here:
https://github.com/mshukor/UnIVAL.
Related papers
- 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities [17.374241865041856]
We show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance.
We successfully scale the training to a three billion parameter model using tens of modalities and different datasets.
The resulting models and training code are open sourced at 4m.epfl.ch.
arXiv Detail & Related papers (2024-06-13T17:59:42Z) - MOWA: Multiple-in-One Image Warping Model [65.73060159073644]
We propose a Multiple-in-One image warping model (named MOWA) in this work.
We mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level.
To our knowledge, this is the first work that solves multiple practical warping tasks in one single model.
arXiv Detail & Related papers (2024-04-16T16:50:35Z) - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action.
We train our model from scratch on a large multimodal pre-training corpus from diverse sources.
With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.