Revisiting Single-gated Mixtures of Experts
- URL: http://arxiv.org/abs/2304.05497v1
- Date: Tue, 11 Apr 2023 21:07:59 GMT
- Title: Revisiting Single-gated Mixtures of Experts
- Authors: Amelie Royer, Ilia Karmanov, Andrii Skliar, Babak Ehteshami Bejnordi,
Tijmen Blankevoort
- Abstract summary: We propose to revisit the simple single-gate MoE, which allows for more practical training.
Key to our work are (i) a base model branch acting both as an early-exit and an ensembling regularization scheme.
We show experimentally that the proposed model obtains efficiency-to-accuracy trade-offs comparable with other more complex MoE.
- Score: 13.591354795556972
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture of Experts (MoE) are rising in popularity as a means to train
extremely large-scale models, yet allowing for a reasonable computational cost
at inference time. Recent state-of-the-art approaches usually assume a large
number of experts, and require training all experts jointly, which often lead
to training instabilities such as the router collapsing In contrast, in this
work, we propose to revisit the simple single-gate MoE, which allows for more
practical training. Key to our work are (i) a base model branch acting both as
an early-exit and an ensembling regularization scheme, (ii) a simple and
efficient asynchronous training pipeline without router collapse issues, and
finally (iii) a per-sample clustering-based initialization. We show
experimentally that the proposed model obtains efficiency-to-accuracy
trade-offs comparable with other more complex MoE, and outperforms non-mixture
baselines. This showcases the merits of even a simple single-gate MoE, and
motivates further exploration in this area.
Related papers
- A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models.
The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss.
Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z) - Exploiting Inter-Layer Expert Affinity for Accelerating
Mixture-of-Experts Model Inference [3.217776693788795]
We propose a lightweight optimization technique called ExFlow to largely accelerate the inference of pre-trained MoE models.
By exploiting the inter-layer expert affinity, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation.
Our solution beats the cutting-edge MoE implementations with experts from 8 to 64, with up to 2.2x improvement in inference throughput.
arXiv Detail & Related papers (2024-01-16T14:16:47Z) - Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders [63.28408887247742]
We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
arXiv Detail & Related papers (2023-11-16T10:42:58Z) - Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - EBJR: Energy-Based Joint Reasoning for Adaptive Inference [10.447353952054492]
State-of-the-art deep learning models have achieved significant performance levels on various benchmarks.
Light-weight architectures, on the other hand, achieve moderate accuracies, but at a much more desirable latency.
This paper presents a new method of jointly using the large accurate models together with the small fast ones.
arXiv Detail & Related papers (2021-10-20T02:33:31Z) - Pool of Experts: Realtime Querying Specialized Knowledge in Massive
Neural Networks [0.20305676256390928]
This paper proposes a framework, called Pool of Experts (PoE), that instantly builds a lightweight and task-specific model without any training process.
For a realtime model querying service, PoE first extracts a pool of primitive components, called experts, from a well-trained and sufficiently generic network.
PoE can build a fairly accurate yet compact model in a realtime manner, whereas it takes a few minutes per query for the other training methods to achieve a similar level of accuracy.
arXiv Detail & Related papers (2021-07-03T06:31:54Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.