Prismer: A Vision-Language Model with Multi-Task Experts
- URL: http://arxiv.org/abs/2303.02506v3
- Date: Thu, 18 Jan 2024 22:09:40 GMT
- Title: Prismer: A Vision-Language Model with Multi-Task Experts
- Authors: Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima
Anandkumar
- Abstract summary: Prismer is a data- and parameter-efficient vision-language model that leverages an ensemble of task-specific experts.
By leveraging experts from a wide range of domains, we show Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks.
In experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts.
- Score: 119.82149763682156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent vision-language models have shown impressive multi-modal generation
capabilities. However, typically they require training huge models on massive
datasets. As a more scalable alternative, we introduce Prismer, a data- and
parameter-efficient vision-language model that leverages an ensemble of
task-specific experts. Prismer only requires training of a small number of
components, with the majority of network weights inherited from multiple
readily-available, pre-trained experts, and kept frozen during training. By
leveraging experts from a wide range of domains, we show Prismer can
efficiently pool this expert knowledge and adapt it to various vision-language
reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned
and few-shot learning performance which is competitive with current
state-of-the-arts, whilst requiring up to two orders of magnitude less training
data. Code is available at https://github.com/NVlabs/prismer.
Related papers
- MoVA: Adapting Mixture of Vision Experts to Multimodal Context [38.8308841469793]
We propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism.
In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts.
In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge.
arXiv Detail & Related papers (2024-04-19T17:59:48Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs.
Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.