Adaptive Gating in Mixture-of-Experts based Language Models
- URL: http://arxiv.org/abs/2310.07188v1
- Date: Wed, 11 Oct 2023 04:30:18 GMT
- Title: Adaptive Gating in Mixture-of-Experts based Language Models
- Authors: Jiamin Li, Qiang Su, Yitao Yang, Yimin Jiang, Cong Wang, Hong Xu
- Abstract summary: Sparsely activated mixture-of-experts (MoE) has emerged as a promising solution for scaling models.
This paper introduces adaptive gating in MoE, a flexible training strategy that allows tokens to be processed by a variable number of experts.
- Score: 7.936874532105228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models, such as OpenAI's ChatGPT, have demonstrated
exceptional language understanding capabilities in various NLP tasks. Sparsely
activated mixture-of-experts (MoE) has emerged as a promising solution for
scaling models while maintaining a constant number of computational operations.
Existing MoE model adopts a fixed gating network where each token is computed
by the same number of experts. However, this approach contradicts our intuition
that the tokens in each sequence vary in terms of their linguistic complexity
and, consequently, require different computational costs. Little is discussed
in prior research on the trade-off between computation per token and model
performance. This paper introduces adaptive gating in MoE, a flexible training
strategy that allows tokens to be processed by a variable number of experts
based on expert probability distribution. The proposed framework preserves
sparsity while improving training efficiency. Additionally, curriculum learning
is leveraged to further reduce training time. Extensive experiments on diverse
NLP tasks show that adaptive gating reduces at most 22.5% training time while
maintaining inference quality. Moreover, we conduct a comprehensive analysis of
the routing decisions and present our insights when adaptive gating is used.
Related papers
- Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models [4.109351791494196]
We introduce the Dynamic Mixture of Experts (DynMoE) technique to enhance the efficiency of training and inference for Transformer-based foundational models.
DynMoE incorporates a novel gating method that enables each token to automatically determine the number of experts to activate.
Our results demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks.
arXiv Detail & Related papers (2024-05-23T08:18:30Z) - FedLALR: Client-Specific Adaptive Learning Rates Achieve Linear Speedup
for Non-IID Data [54.81695390763957]
Federated learning is an emerging distributed machine learning method.
We propose a heterogeneous local variant of AMSGrad, named FedLALR, in which each client adjusts its learning rate.
We show that our client-specified auto-tuned learning rate scheduling can converge and achieve linear speedup with respect to the number of clients.
arXiv Detail & Related papers (2023-09-18T12:35:05Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Adversarial Training with Contrastive Learning in NLP [0.0]
We propose adversarial training with contrastive learning (ATCL) to adversarially train a language processing task.
The core idea is to make linear perturbations in the embedding space of the input via fast gradient methods (FGM) and train the model to keep the original and perturbed representations close via contrastive learning.
The results show not only an improvement in the quantitative (perplexity and BLEU) scores when compared to the baselines, but ATCL also achieves good qualitative results in the semantic level for both tasks.
arXiv Detail & Related papers (2021-09-19T07:23:45Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Active Learning for Sequence Tagging with Deep Pre-trained Models and
Bayesian Uncertainty Estimates [52.164757178369804]
Recent advances in transfer learning for natural language processing in conjunction with active learning open the possibility to significantly reduce the necessary annotation budget.
We conduct an empirical study of various Bayesian uncertainty estimation methods and Monte Carlo dropout options for deep pre-trained models in the active learning framework.
We also demonstrate that to acquire instances during active learning, a full-size Transformer can be substituted with a distilled version, which yields better computational performance.
arXiv Detail & Related papers (2021-01-20T13:59:25Z) - Reducing Confusion in Active Learning for Part-Of-Speech Tagging [100.08742107682264]
Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost.
We study the problem of selecting instances which maximally reduce the confusion between particular pairs of output tags.
Our proposed AL strategy outperforms other AL strategies by a significant margin.
arXiv Detail & Related papers (2020-11-02T06:24:58Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.