M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task
Learning with Model-Accelerator Co-design
- URL: http://arxiv.org/abs/2210.14793v1
- Date: Wed, 26 Oct 2022 15:40:24 GMT
- Title: M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task
Learning with Model-Accelerator Co-design
- Authors: Hanxue Liang, Zhiwen Fan, Rishov Sarkar, Ziyu Jiang, Tianlong Chen,
Kai Zou, Yu Cheng, Cong Hao, Zhangyang Wang
- Abstract summary: Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly.
Current MTL regimes have to activate nearly the entire model even to just execute a single task.
We present a model-accelerator co-design framework to enable efficient on-device MTL.
- Score: 95.41238363769892
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-task learning (MTL) encapsulates multiple learned tasks in a single
model and often lets those tasks learn better jointly. However, when deploying
MTL onto those real-world systems that are often resource-constrained or
latency-sensitive, two prominent challenges arise: (i) during training,
simultaneously optimizing all tasks is often difficult due to gradient
conflicts across tasks; (ii) at inference, current MTL regimes have to activate
nearly the entire model even to just execute a single task. Yet most real
systems demand only one or two tasks at each moment, and switch between tasks
as needed: therefore such all tasks activated inference is also highly
inefficient and non-scalable. In this paper, we present a model-accelerator
co-design framework to enable efficient on-device MTL. Our framework, dubbed
M$^3$ViT, customizes mixture-of-experts (MoE) layers into a vision transformer
(ViT) backbone for MTL, and sparsely activates task-specific experts during
training. Then at inference with any task of interest, the same design allows
for activating only the task-corresponding sparse expert pathway, instead of
the full model. Our new model design is further enhanced by hardware-level
innovations, in particular, a novel computation reordering scheme tailored for
memory-constrained MTL that achieves zero-overhead switching between tasks and
can scale to any number of experts. When executing single-task inference,
M$^{3}$ViT achieves higher accuracies than encoder-focused MTL methods, while
significantly reducing 88% inference FLOPs. When implemented on a hardware
platform of one Xilinx ZCU104 FPGA, our co-design framework reduces the memory
requirement by 2.4 times, while achieving energy efficiency up to 9.23 times
higher than a comparable FPGA baseline. Code is available at:
https://github.com/VITA-Group/M3ViT.
Related papers
- AdapMTL: Adaptive Pruning Framework for Multitask Learning Model [5.643658120200373]
AdapMTL is an adaptive pruning framework for multitask models.
It balances sparsity allocation and accuracy performance across multiple tasks.
It showcases superior performance compared to state-of-the-art pruning methods.
arXiv Detail & Related papers (2024-08-07T17:19:15Z) - MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning [28.12788291168137]
We present a multi-task fine-tuning framework, MFTcoder, that enables simultaneous and parallel fine-tuning on multiple tasks.
Experiments have conclusively demonstrated that our multi-task fine-tuning approach outperforms both individual fine-tuning on single tasks and fine-tuning on a mixed ensemble of tasks.
arXiv Detail & Related papers (2023-11-04T02:22:40Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - AdaMTL: Adaptive Input-dependent Inference for Efficient Multi-Task
Learning [1.4963011898406864]
We introduce AdaMTL, an adaptive framework that learns task-aware inference policies for multi-task learning models.
AdaMTL reduces the computational complexity by 43% while improving the accuracy by 1.32% compared to single-task models.
When deployed on Vuzix M4000 smart glasses, AdaMTL reduces the inference latency and the energy consumption by up to 21.8% and 37.5%, respectively.
arXiv Detail & Related papers (2023-04-17T20:17:44Z) - Controllable Dynamic Multi-Task Architectures [92.74372912009127]
We propose a controllable multi-task network that dynamically adjusts its architecture and weights to match the desired task preference as well as the resource constraints.
We propose a disentangled training of two hypernetworks, by exploiting task affinity and a novel branching regularized loss, to take input preferences and accordingly predict tree-structured models with adapted weights.
arXiv Detail & Related papers (2022-03-28T17:56:40Z) - Controllable Pareto Multi-Task Learning [55.945680594691076]
A multi-task learning system aims at solving multiple related tasks at the same time.
With a fixed model capacity, the tasks would be conflicted with each other, and the system usually has to make a trade-off among learning all of them together.
This work proposes a novel controllable multi-task learning framework, to enable the system to make real-time trade-off control among different tasks with a single model.
arXiv Detail & Related papers (2020-10-13T11:53:55Z) - Reparameterizing Convolutions for Incremental Multi-Task Learning
without Task Interference [75.95287293847697]
Two common challenges in developing multi-task models are often overlooked in literature.
First, enabling the model to be inherently incremental, continuously incorporating information from new tasks without forgetting the previously learned ones (incremental learning)
Second, eliminating adverse interactions amongst tasks, which has been shown to significantly degrade the single-task performance in a multi-task setup (task interference)
arXiv Detail & Related papers (2020-07-24T14:44:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.