Related papers: AutoTaskFormer: Searching Vision Transformers for Multi-task Learning

AutoTaskFormer: Searching Vision Transformers for Multi-task Learning

URL: http://arxiv.org/abs/2304.08756v2
Date: Thu, 20 Apr 2023 02:27:04 GMT
Title: AutoTaskFormer: Searching Vision Transformers for Multi-task Learning
Authors: Yang Liu, Shen Yan, Yuge Zhang, Kan Ren, Quanlu Zhang, Zebin Ren, Deng Cai, Mi Zhang
Abstract summary: Vision Transformers have shown great performance in single tasks such as classification and segmentation. Existing multi-task vision transformers are handcrafted and heavily rely on human expertise. We propose a novel one-shot neural architecture search framework, dubbed AutoTaskFormer, to automate this process.
Score: 35.38583552145653
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers have shown great performance in single tasks such as classification and segmentation. However, real-world problems are not isolated, which calls for vision transformers that can perform multiple tasks concurrently. Existing multi-task vision transformers are handcrafted and heavily rely on human expertise. In this work, we propose a novel one-shot neural architecture search framework, dubbed AutoTaskFormer (Automated Multi-Task Vision TransFormer), to automate this process. AutoTaskFormer not only identifies the weights to share across multiple tasks automatically, but also provides thousands of well-trained vision transformers with a wide range of parameters (e.g., number of heads and network depth) for deployment under various resource constraints. Experiments on both small-scale (2-task Cityscapes and 3-task NYUv2) and large-scale (16-task Taskonomy) datasets show that AutoTaskFormer outperforms state-of-the-art handcrafted vision transformers in multi-task learning. The entire code and models will be open-sourced.

Related papers

Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models [49.4824734958566]
Chain-of-Modality (CoM) enables Vision Language Models to reason about multimodal human demonstration data. CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt.
arXiv Detail & Related papers (2025-04-17T21:31:23Z)
Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving [85.62076860189116]
Video Task Decathlon (VTD) includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels. We develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks.
arXiv Detail & Related papers (2023-09-08T16:33:27Z)
Vision Transformer Adapters for Generalizable Multitask Learning [61.79647180647685]
We introduce the first multitasking vision transformer adapters that learn generalizable task affinities. Our adapters can simultaneously solve multiple dense vision tasks in a parameter-efficient manner. In contrast to concurrent methods, we do not require retraining or fine-tuning whenever a new task or domain is added.
arXiv Detail & Related papers (2023-08-23T18:40:48Z)
An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z)
InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding [11.608682595506354]
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions. We propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context.
arXiv Detail & Related papers (2023-06-08T00:28:22Z)
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z)
Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks [36.34331439747556]
We propose Polyhistor and Polyhistor-Lite to share information across different tasks with a few trainable parameters. Specifically, Polyhistor achieves competitive accuracy compared to the state-of-the-art while only using 10% of their trainable parameters.
arXiv Detail & Related papers (2022-10-07T00:25:02Z)
MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks. Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z)
Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks [37.2958914602899]
We show that we can learn adapter parameters for all layers and tasks by generating them using shared hypernetworks. Experiments on the well-known GLUE benchmark show improved performance in multi-task learning while adding only 0.29% parameters per task.
arXiv Detail & Related papers (2021-06-08T16:16:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.