Residual Mixture of Experts
- URL: http://arxiv.org/abs/2204.09636v1
- Date: Wed, 20 Apr 2022 17:29:48 GMT
- Title: Residual Mixture of Experts
- Authors: Lemeng Wu, Mengchen Liu, Yinpeng Chen, Dongdong Chen, Xiyang Dai, Lu
Yuan
- Abstract summary: Residual Mixture of Experts (RMoE) is an efficient training pipeline for MoE vision transformers on downstream tasks.
RMoE achieves comparable results with the upper-bound MoE training, while only introducing minor additional training cost.
- Score: 75.5489156421442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture of Experts (MoE) is able to scale up vision transformers effectively.
However, it requires prohibiting computation resources to train a large MoE
transformer. In this paper, we propose Residual Mixture of Experts (RMoE), an
efficient training pipeline for MoE vision transformers on downstream tasks,
such as segmentation and detection. RMoE achieves comparable results with the
upper-bound MoE training, while only introducing minor additional training cost
than the lower-bound non-MoE training pipelines. The efficiency is supported by
our key observation: the weights of an MoE transformer can be factored into an
input-independent core and an input-dependent residual. Compared with the
weight core, the weight residual can be efficiently trained with much less
computation resource, e.g., finetuning on the downstream data. We show that,
compared with the current MoE training pipeline, we get comparable results
while saving over 30% training cost. When compared with state-of-the-art non-
MoE transformers, such as Swin-T / CvT-13 / Swin-L, we get +1.1 / 0.9 / 1.0
mIoU gain on ADE20K segmentation and +1.4 / 1.6 / 0.6 AP gain on MS-COCO object
detection task with less than 3% additional training cost.
Related papers
- Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining [32.925150708409205]
Mixed Sparsity Training (MST) is an efficient pretraining method that can reduce about $75%$ of Floating Point Operations (FLOPs) while maintaining performance.
Our experiment on GPT-2 showcases a FLOP reduction of $4times$ without compromising performance.
arXiv Detail & Related papers (2024-08-21T16:13:16Z) - Toward Inference-optimal Mixture-of-Expert Large Language Models [55.96674056805708]
We study the scaling law of MoE-based large language models (LLMs)
We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training.
We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss.
arXiv Detail & Related papers (2024-04-03T16:33:42Z) - PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation.
PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z) - Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters [11.05223262950967]
Mixture of Experts (MoE) architectures have recently started burgeoning due to their ability to scale model's capacity while maintaining the computational cost affordable.
This paper attempts to demystify the use of MoE for parameter-efficient fine-tuning of Audio Spectrogram Transformers to audio and speech downstream tasks.
It exploits adapters as the experts and, leveraging the recent Soft MoE method, it relies on a soft assignment between the input tokens and experts to keep the computational time limited.
arXiv Detail & Related papers (2024-02-01T18:16:04Z) - Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - EquiformerV2: Improved Equivariant Transformer for Scaling to
Higher-Degree Representations [9.718771797861908]
We propose EquiformerV2, which outperforms previous state-of-the-art methods on large-scale OC20 dataset by up to $9%$ on forces.
We also compare EquiformerV2 with Equiformer on QM9 and OC20 S2EF-2M datasets to better understand the performance gain brought by higher degrees.
arXiv Detail & Related papers (2023-06-21T07:01:38Z) - Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable
Transformers [107.3726071306935]
We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse.
SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time.
Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
arXiv Detail & Related papers (2023-03-02T22:12:51Z) - Beyond Distillation: Task-level Mixture-of-Experts for Efficient
Inference [17.97893143555333]
Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation.
In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation.
Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models.
arXiv Detail & Related papers (2021-09-24T20:42:16Z) - Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.