Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners
- URL: http://arxiv.org/abs/2204.07689v1
- Date: Sat, 16 Apr 2022 00:56:12 GMT
- Title: Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners
- Authors: Shashank Gupta, Subhabrata Mukherjee, Krishan Subudhi, Eduardo
Gonzalez, Damien Jose, Ahmed H. Awadallah, Jianfeng Gao
- Abstract summary: We study whether sparsely activated Mixture-of-Experts (MoE) improve multi-task learning.
We devise task-aware gating functions to route examples from different tasks to specialized experts.
This results in a sparsely activated multi-task model with a large number of parameters, but with the same computational cost as that of a dense model.
- Score: 67.5865966762559
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional multi-task learning (MTL) methods use dense networks that use the
same set of shared weights across several different tasks. This often creates
interference where two or more tasks compete to pull model parameters in
different directions. In this work, we study whether sparsely activated
Mixture-of-Experts (MoE) improve multi-task learning by specializing some
weights for learning shared representations and using the others for learning
task-specific information. To this end, we devise task-aware gating functions
to route examples from different tasks to specialized experts which share
subsets of network weights conditioned on the task. This results in a sparsely
activated multi-task model with a large number of parameters, but with the same
computational cost as that of a dense model. We demonstrate such sparse
networks to improve multi-task learning along three key dimensions: (i)
transfer to low-resource tasks from related tasks in the training mixture; (ii)
sample-efficient generalization to tasks not seen during training by making use
of task-aware routing from seen related tasks; (iii) robustness to the addition
of unrelated tasks by avoiding catastrophic forgetting of existing tasks.
Related papers
- DiSparse: Disentangled Sparsification for Multitask Model Compression [92.84435347164435]
DiSparse is a simple, effective, and first-of-its-kind multitask pruning and sparse training scheme.
Our experimental results demonstrate superior performance on various configurations and settings.
arXiv Detail & Related papers (2022-06-09T17:57:46Z) - Modular Adaptive Policy Selection for Multi-Task Imitation Learning
through Task Division [60.232542918414985]
Multi-task learning often suffers from negative transfer, sharing information that should be task-specific.
This is done by using proto-policies as modules to divide the tasks into simple sub-behaviours that can be shared.
We also demonstrate its ability to autonomously divide the tasks into both shared and task-specific sub-behaviours.
arXiv Detail & Related papers (2022-03-28T15:53:17Z) - Multi-Task Learning with Sequence-Conditioned Transporter Networks [67.57293592529517]
We aim to solve multi-task learning through the lens of sequence-conditioning and weighted sampling.
We propose a new suite of benchmark aimed at compositional tasks, MultiRavens, which allows defining custom task combinations.
Second, we propose a vision-based end-to-end system architecture, Sequence-Conditioned Transporter Networks, which augments Goal-Conditioned Transporter Networks with sequence-conditioning and weighted sampling.
arXiv Detail & Related papers (2021-09-15T21:19:11Z) - MultiTask-CenterNet (MCN): Efficient and Diverse Multitask Learning
using an Anchor Free Approach [0.13764085113103217]
Multitask learning is a common approach in machine learning.
In this paper we augment the CenterNet anchor-free approach for training multiple perception related tasks together.
arXiv Detail & Related papers (2021-08-11T06:57:04Z) - Multi-Task Learning with Deep Neural Networks: A Survey [0.0]
Multi-task learning (MTL) is a subfield of machine learning in which multiple tasks are simultaneously learned by a shared model.
We give an overview of multi-task learning methods for deep neural networks, with the aim of summarizing both the well-established and most recent directions within the field.
arXiv Detail & Related papers (2020-09-10T19:31:04Z) - Reparameterizing Convolutions for Incremental Multi-Task Learning
without Task Interference [75.95287293847697]
Two common challenges in developing multi-task models are often overlooked in literature.
First, enabling the model to be inherently incremental, continuously incorporating information from new tasks without forgetting the previously learned ones (incremental learning)
Second, eliminating adverse interactions amongst tasks, which has been shown to significantly degrade the single-task performance in a multi-task setup (task interference)
arXiv Detail & Related papers (2020-07-24T14:44:46Z) - Navigating the Trade-Off between Multi-Task Learning and Learning to
Multitask in Deep Neural Networks [9.278739724750343]
Multi-task learning refers to a paradigm in machine learning in which a network is trained on various related tasks to facilitate the acquisition of tasks.
multitasking is used to indicate, especially in the cognitive science literature, the ability to execute multiple tasks simultaneously.
We show that the same tension arises in deep networks and discuss a meta-learning algorithm for an agent to manage this trade-off in an unfamiliar environment.
arXiv Detail & Related papers (2020-07-20T23:26:16Z) - Knowledge Distillation for Multi-task Learning [38.20005345733544]
Multi-task learning (MTL) is to learn one single model that performs multiple tasks for achieving good performance on all tasks and lower cost on computation.
Learning such a model requires to jointly optimize losses of a set of tasks with different difficulty levels, magnitudes, and characteristics.
We propose a knowledge distillation based method in this work to address the imbalance problem in multi-task learning.
arXiv Detail & Related papers (2020-07-14T08:02:42Z) - MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning [82.62433731378455]
We show that tasks with high affinity at a certain scale are not guaranteed to retain this behaviour at other scales.
We propose a novel architecture, namely MTI-Net, that builds upon this finding.
arXiv Detail & Related papers (2020-01-19T21:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.