AutoTaskFormer: Searching Vision Transformers for Multi-task Learning
- URL: http://arxiv.org/abs/2304.08756v2
- Date: Thu, 20 Apr 2023 02:27:04 GMT
- Title: AutoTaskFormer: Searching Vision Transformers for Multi-task Learning
- Authors: Yang Liu, Shen Yan, Yuge Zhang, Kan Ren, Quanlu Zhang, Zebin Ren, Deng
Cai, Mi Zhang
- Abstract summary: Vision Transformers have shown great performance in single tasks such as classification and segmentation.
Existing multi-task vision transformers are handcrafted and heavily rely on human expertise.
We propose a novel one-shot neural architecture search framework, dubbed AutoTaskFormer, to automate this process.
- Score: 35.38583552145653
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers have shown great performance in single tasks such as
classification and segmentation. However, real-world problems are not isolated,
which calls for vision transformers that can perform multiple tasks
concurrently. Existing multi-task vision transformers are handcrafted and
heavily rely on human expertise. In this work, we propose a novel one-shot
neural architecture search framework, dubbed AutoTaskFormer (Automated
Multi-Task Vision TransFormer), to automate this process. AutoTaskFormer not
only identifies the weights to share across multiple tasks automatically, but
also provides thousands of well-trained vision transformers with a wide range
of parameters (e.g., number of heads and network depth) for deployment under
various resource constraints. Experiments on both small-scale (2-task
Cityscapes and 3-task NYUv2) and large-scale (16-task Taskonomy) datasets show
that AutoTaskFormer outperforms state-of-the-art handcrafted vision
transformers in multi-task learning. The entire code and models will be
open-sourced.
Related papers
- Video Task Decathlon: Unifying Image and Video Tasks in Autonomous
Driving [85.62076860189116]
Video Task Decathlon (VTD) includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels.
We develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks.
arXiv Detail & Related papers (2023-09-08T16:33:27Z) - Vision Transformer Adapters for Generalizable Multitask Learning [61.79647180647685]
We introduce the first multitasking vision transformer adapters that learn generalizable task affinities.
Our adapters can simultaneously solve multiple dense vision tasks in a parameter-efficient manner.
In contrast to concurrent methods, we do not require retraining or fine-tuning whenever a new task or domain is added.
arXiv Detail & Related papers (2023-08-23T18:40:48Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene
Understanding [11.608682595506354]
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model.
Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions.
We propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context.
arXiv Detail & Related papers (2023-06-08T00:28:22Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision
Tasks [36.34331439747556]
We propose Polyhistor and Polyhistor-Lite to share information across different tasks with a few trainable parameters.
Specifically, Polyhistor achieves competitive accuracy compared to the state-of-the-art while only using 10% of their trainable parameters.
arXiv Detail & Related papers (2022-10-07T00:25:02Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - Parameter-efficient Multi-task Fine-tuning for Transformers via Shared
Hypernetworks [37.2958914602899]
We show that we can learn adapter parameters for all layers and tasks by generating them using shared hypernetworks.
Experiments on the well-known GLUE benchmark show improved performance in multi-task learning while adding only 0.29% parameters per task.
arXiv Detail & Related papers (2021-06-08T16:16:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.