Designing Effective Sparse Expert Models
- URL: http://arxiv.org/abs/2202.08906v1
- Date: Thu, 17 Feb 2022 21:39:10 GMT
- Title: Designing Effective Sparse Expert Models
- Authors: Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff
Dean, Noam Shazeer, William Fedus
- Abstract summary: Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to larger and more capable language models.
But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning.
We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer.
- Score: 45.21279650229869
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scale has opened new frontiers in natural language processing -- but at a
high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have
been proposed as an energy efficient path to even larger and more capable
language models. But advancing the state-of-the-art across a broad set of
natural language tasks has been hindered by training instabilities and
uncertain quality during fine-tuning. Our work focuses on these issues and acts
as a design guide. We conclude by scaling a sparse model to 269B parameters,
with a computational cost comparable to a 32B dense encoder-decoder Transformer
(Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time,
a sparse model achieves state-of-the-art performance in transfer learning,
across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC
Challenge), summarization (XSum, CNN-DM), closed book question answering
(WebQA, Natural Questions), and adversarially constructed tasks (Winogrande,
ANLI R3).
Related papers
- U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF [10.81723269312202]
Mixture-of-Experts (MoE) have been proposed as an energy efficient path to larger and more capable language models.
We benchmark our proposed model on a large scale inner-source dataset (160k hours)
arXiv Detail & Related papers (2024-04-25T08:34:21Z) - Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks [5.536630285985836]
We introduce parameter-efficient sparsity crafting (PESC)
PESC crafts dense models into sparse models using the mixture-of-experts (MoE) architecture.
Our best sparse model outperforms other sparse and dense models and exhibits superior general capabilities compared to GP3.5.
arXiv Detail & Related papers (2024-01-05T09:58:09Z) - PanGu-{\Sigma}: Towards Trillion Parameter Language Model with Sparse
Heterogeneous Computing [64.53242758625922]
PanGu-Sigma is trained on a cluster of Ascend 910 AI processors and MindSpore framework.
It provides state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks.
arXiv Detail & Related papers (2023-03-20T03:39:27Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Beyond the Imitation Game: Quantifying and extrapolating the
capabilities of language models [648.3665819567409]
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale.
Big-bench consists of 204 tasks, contributed by 450 authors across 132 institutions.
We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench.
arXiv Detail & Related papers (2022-06-09T17:05:34Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - SML: a new Semantic Embedding Alignment Transformer for efficient
cross-lingual Natural Language Inference [71.57324258813674]
The ability of Transformers to perform with precision a variety of tasks such as question answering, Natural Language Inference (NLI) or summarising, have enable them to be ranked as one of the best paradigms to address this kind of tasks at present.
NLI is one of the best scenarios to test these architectures, due to the knowledge required to understand complex sentences and established a relation between a hypothesis and a premise.
In this paper, we propose a new architecture, siamese multilingual transformer, to efficiently align multilingual embeddings for Natural Language Inference.
arXiv Detail & Related papers (2021-03-17T13:23:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.