MoEfication: Conditional Computation of Transformer Models for Efficient
Inference
- URL: http://arxiv.org/abs/2110.01786v1
- Date: Tue, 5 Oct 2021 02:14:38 GMT
- Title: MoEfication: Conditional Computation of Transformer Models for Efficient
Inference
- Authors: Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, Jie
Zhou
- Abstract summary: Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
- Score: 66.56994436947441
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based pre-trained language models can achieve superior
performance on most NLP tasks due to large parameter capacity, but also lead to
huge computation cost. Fortunately, we find by empirical study that, most
inputs only activate a tiny ratio of neurons during inference. Hence, we
explore to accelerate large-model inference by conditional computation based on
the sparse activation phenomenon. We propose to transform a large model into
its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
Model MoEfication consists of two steps: (1) splitting the parameters of
feed-forward neural networks (FFNs) into multiple parts as experts, and (2)
building expert routers to decide which experts will be used for each input. To
further improve the performance of MoEfied models, we can also fine-tune the
models on downstream tasks, namely parameter calibration. Experimental results
show that the MoEfied models can significantly reduce computation cost, e.g.,
only activating 20% FFN parameters of a 700-million-parameter model without
performance degradation on several downstream tasks including text
classification and reading comprehension.
Related papers
- Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
Our research explores task-specific model pruning to inform decisions about designing SMoE architectures.
We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z) - Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things [11.802983172874901]
The implementation of machine learning in Internet of Things devices poses significant operational challenges due to limited energy and computation resources.
We present adaptive resolution inference (ARI), a novel approach that enables to evaluate new tradeoffs between energy dissipation and model performance.
arXiv Detail & Related papers (2024-08-26T16:00:26Z) - Towards Efficient Pareto Set Approximation via Mixture of Experts Based Model Fusion [53.33473557562837]
Solving multi-objective optimization problems for large deep neural networks is a challenging task due to the complexity of the loss landscape and the expensive computational cost.
We propose a practical and scalable approach to solve this problem via mixture of experts (MoE) based model fusion.
By ensembling the weights of specialized single-task models, the MoE module can effectively capture the trade-offs between multiple objectives.
arXiv Detail & Related papers (2024-06-14T07:16:18Z) - XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection [30.687511115573038]
tool is a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models.
tool can enhance model performance while decreasing the computation load at MoE layers by over 50% without sacrificing performance.
arXiv Detail & Related papers (2024-02-27T08:18:02Z) - Pruning Large Language Models via Accuracy Predictor [0.0]
Large language models (LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks.
We propose a novel pruning approach: firstly, a training set of a certain number of architecture-accuracy pairs is established, and then a non-neural model is trained as an accuracy predictor.
arXiv Detail & Related papers (2023-09-18T06:38:24Z) - Understanding Parameter Sharing in Transformers [53.75988363281843]
Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth.
We show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity.
Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.
arXiv Detail & Related papers (2023-06-15T10:48:59Z) - Double and Single Descent in Causal Inference with an Application to
High-Dimensional Synthetic Control [2.3173485093942943]
In machine learning, there may be so many free parameters that the model fits the training data perfectly.
We document the performance of high-dimensional synthetic control estimators with many control units.
We find that adding control units can help improve imputation performance even beyond the point where the pre-treatment fit is perfect.
arXiv Detail & Related papers (2023-05-01T07:54:53Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs)
We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.