An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training
- URL: http://arxiv.org/abs/2306.17165v1
- Date: Thu, 29 Jun 2023 17:59:57 GMT
- Title: An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training
- Authors: Zitian Chen, Mingyu Ding, Yikang Shen, Wei Zhan, Masayoshi Tomizuka,
Erik Learned-Miller, Chuang Gan
- Abstract summary: We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
- Score: 79.78201886156513
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We present a model that can perform multiple vision tasks and can be adapted
to other downstream tasks efficiently. Despite considerable progress in
multi-task learning, most efforts focus on learning from multi-label data: a
single image set with multiple task labels. Such multi-label data sets are
rare, small, and expensive. We say heterogeneous to refer to image sets with
different task labels, or to combinations of single-task datasets. Few have
explored training on such heterogeneous datasets. General-purpose vision models
are still dominated by single-task pretraining, and it remains unclear how to
scale up multi-task models by leveraging mainstream vision datasets designed
for different purposes. The challenges lie in managing large intrinsic
differences among vision tasks, including data distribution, architectures,
task-specific modules, dataset scales, and sampling strategies. To address
these challenges, we propose to modify and scale up mixture-of-experts (MoE)
vision transformers, so that they can simultaneously learn classification,
detection, and segmentation on diverse mainstream vision datasets including
ImageNet, COCO, and ADE20K. Our approach achieves comparable results to
single-task state-of-the-art models and demonstrates strong generalization on
downstream tasks. Due to its emergent modularity, this general-purpose model
decomposes into high-performing components, efficiently adapting to downstream
tasks. We can fine-tune it with fewer training parameters, fewer model
parameters, and less computation. Additionally, its modularity allows for easy
expansion in continual-learning-without-forgetting scenarios. Finally, these
functions can be controlled and combined to meet various demands of downstream
tasks.
Related papers
- A Multitask Deep Learning Model for Classification and Regression of Hyperspectral Images: Application to the large-scale dataset [44.94304541427113]
We propose a multitask deep learning model to perform multiple classification and regression tasks simultaneously on hyperspectral images.
We validated our approach on a large hyperspectral dataset called TAIGA.
A comprehensive qualitative and quantitative analysis of the results shows that the proposed method significantly outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-23T11:14:54Z) - Investigating Self-Supervised Methods for Label-Efficient Learning [27.029542823306866]
We study different self supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling for their low-shot capabilities.
We introduce a framework involving both mask image modelling and clustering as pretext tasks, which performs better across all low-shot downstream tasks.
When testing the model on full scale datasets, we show performance gains in multi-class classification, multi-label classification and semantic segmentation.
arXiv Detail & Related papers (2024-06-25T10:56:03Z) - Merging Vision Transformers from Different Tasks and Domains [46.40701388197936]
This work targets to merge various Vision Transformers (ViTs) trained on different tasks (i.e., datasets with different object categories) or domains (i.e., datasets with the same categories but different environments) into one unified model.
Previous model merging works focus on either CNNs or NLP models, leaving the ViTs merging research untouched.
arXiv Detail & Related papers (2023-12-25T09:32:28Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - Zero Experience Required: Plug & Play Modular Transfer Learning for
Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model.
Our model can effectively leverage its experience from one source task and apply it to multiple target tasks.
Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z) - The Effect of Diversity in Meta-Learning [79.56118674435844]
Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples.
Recent studies show that task distribution plays a vital role in the model's performance.
We study different task distributions on a myriad of models and datasets to evaluate the effect of task diversity on meta-learning algorithms.
arXiv Detail & Related papers (2022-01-27T19:39:07Z) - Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
for Zero-shot and Few-shot Tasks [73.63892022944198]
We present a generic perception architecture named Uni-Perceiver.
It processes a variety of modalities and tasks with unified modeling and shared parameters.
Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks.
arXiv Detail & Related papers (2021-12-02T18:59:50Z) - PolyViT: Co-training Vision Transformers on Images, Videos and Audio [80.0913507142036]
We present PolyViT, a model trained on image, audio and video.
By co-training different tasks on a single modality, we are able to improve the accuracy of each individual task.
We show that co-training is simple and practical to implement.
arXiv Detail & Related papers (2021-11-25T10:01:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.