One Model, Multiple Modalities: A Sparsely Activated Approach for Text,
Sound, Image, Video and Code
- URL: http://arxiv.org/abs/2205.06126v1
- Date: Thu, 12 May 2022 14:39:21 GMT
- Title: One Model, Multiple Modalities: A Sparsely Activated Approach for Text,
Sound, Image, Video and Code
- Authors: Yong Dai, Duyu Tang, Liangxin Liu, Minghuan Tan, Cong Zhou, Jingquan
Wang, Zhangyin Feng, Fan Zhang, Xueyu Hu, Shuming Shi
- Abstract summary: This paper presents an approach that excels at handling multiple modalities of information with a single model.
We develop our model for five modalities including text, image, sound, video and code.
Our model supports self-supervised pretraining with the same sparsely activated way, resulting in better parameters for different modalities.
- Score: 26.40920402395547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: People perceive the world with multiple senses (e.g., through hearing sounds,
reading words and seeing objects). However, most existing AI systems only
process an individual modality. This paper presents an approach that excels at
handling multiple modalities of information with a single model. In our
"{SkillNet}" model, different parts of the parameters are specialized for
processing different modalities. Unlike traditional dense models that always
activate all the model parameters, our model sparsely activates parts of the
parameters whose skills are relevant to the task. Such model design enables
SkillNet to learn skills in a more interpretable way. We develop our model for
five modalities including text, image, sound, video and code. Results show
that, SkillNet performs comparably to five modality-specific fine-tuned models.
Moreover, our model supports self-supervised pretraining with the same sparsely
activated way, resulting in better initialized parameters for different
modalities. We find that pretraining significantly improves the performance of
SkillNet on five modalities, on par with or even better than baselines with
modality-specific pretraining. On the task of Chinese text-to-image retrieval,
our final system achieves higher accuracy than existing leading systems
including Wukong{ViT-B} and Wenlan 2.0 while using less number of activated
parameters.
Related papers
- Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small Language Model [0.0]
We present a novel 4.5B parameter small language model that can handle multiple input and output modalities.
Despite its small size, the model achieves near state-of-the-art performance on a variety of tasks.
arXiv Detail & Related papers (2024-11-08T17:15:17Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - Skill-Based Few-Shot Selection for In-Context Learning [123.26522773708683]
Skill-KNN is a skill-based few-shot selection method for in-context learning.
It does not require training or fine-tuning of any models, making it suitable for frequently expanding or changing example banks.
Experimental results across five cross-domain semantic parsing datasets and six backbone models show that Skill-KNN significantly outperforms existing methods.
arXiv Detail & Related papers (2023-05-23T16:28:29Z) - Contrastive Alignment of Vision to Language Through Parameter-Efficient
Transfer Learning [60.26952378997713]
Contrastive vision-language models (e.g. CLIP) are created by updating all the parameters of a vision model and language model through contrastive training.
We show that a minimal set of parameter updates ($$7%) can achieve the same performance as full-model training.
We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training.
arXiv Detail & Related papers (2023-03-21T14:12:08Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Multitask Learning for Low Resource Spoken Language Understanding [26.106133114838215]
We train models on dual objectives with automatic speech recognition and intent classification or sentiment classification.
Our models, although being of modest size, show improvements over models trained end-to-end on intent classification.
We study the performance of the models in low-resource scenario by training the models with as few as one example per class.
arXiv Detail & Related papers (2022-11-24T16:38:17Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - One Model, Multiple Tasks: Pathways for Natural Language Understanding [34.58880663537492]
This paper presents a Pathways approach to handle many tasks at once.
Unlike prevailing single-purpose models that overspecialize at individual tasks and learn from scratch when being extended to new tasks, our approach is general-purpose with the ability of stitching together existing skills to learn new tasks more effectively.
arXiv Detail & Related papers (2022-03-07T11:48:09Z) - Exploring Versatile Generative Language Model Via Parameter-Efficient
Transfer Learning [70.81910984985683]
We propose an effective way to fine-tune multiple down-stream generation tasks simultaneously using a single, large pre-trained model.
The experiments on five diverse language generation tasks show that by just using an additional 2-3% parameters for each task, our model can maintain or even improve the performance of fine-tuning the whole model.
arXiv Detail & Related papers (2020-04-08T06:18:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.