Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models
- URL: http://arxiv.org/abs/2203.01104v1
- Date: Wed, 2 Mar 2022 13:44:49 GMT
- Title: Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models
- Authors: Ze-Feng Gao, Peiyu Liu, Wayne Xin Zhao, Zhong-Yi Lu, Ji-Rong Wen
- Abstract summary: We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics.
With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture.
Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
- Score: 68.9288651177564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The state-of-the-art Mixture-of-Experts (short as MoE) architecture has
achieved several remarkable successes in terms of increasing model capacity.
However, MoE has been hindered widespread adoption due to complexity,
communication costs, and training instability. Here we present a novel MoE
architecture based on matrix product operators (MPO) from quantum many-body
physics. It can decompose an original matrix into central tensors (containing
the core information) and auxiliary tensors (with only a small proportion of
parameters). With the decomposed MPO structure, we can reduce the parameters of
the original MoE architecture by sharing a global central tensor across experts
and keeping expert-specific auxiliary tensors. We also design the gradient mask
strategy for the tensor structure of MPO to alleviate the overfitting problem.
Experiments on the three well-known downstream natural language datasets based
on GPT2 show improved performance and efficiency in increasing model capacity
(7.26x fewer parameters with the same amount of experts). We additionally
demonstrate an improvement in the positive transfer effects of our approach for
multi-task learning.
Related papers
- Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve model's parameter efficiency.
We validate our method by pruning two state-of-the-art MoE models, Mixtral-8x7B and Mixtral-8x22B.
Our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF [10.81723269312202]
Mixture-of-Experts (MoE) have been proposed as an energy efficient path to larger and more capable language models.
We benchmark our proposed model on a large scale inner-source dataset (160k hours)
arXiv Detail & Related papers (2024-04-25T08:34:21Z) - Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers [1.1499643186017316]
We propose Cross-Architecture Transfer Learning (XATL) to improve efficiency of Transformer Language Models.
Methodabbr significantly reduces the training time up to 2.5x times and converges to a better minimum with up to 2.6% stronger model on the LM benchmarks within the same compute budget.
arXiv Detail & Related papers (2024-04-03T12:27:36Z) - Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts
for Instruction Tuning on General Tasks [6.048370838631722]
We introduce.
-Efficient Sparsity Crafting (PESC), which transitions dense models to sparse models.
PESC integrates adapters into the MoE layers of sparse models, differentiating experts without altering individual weights within these layers.
Our sparse models, dubbed Camelidae, outperform all other opensource sparse models and exhibit superior general capabilities compared to GPT3.5.
arXiv Detail & Related papers (2024-01-05T09:58:09Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Improving Expert Specialization in Mixture of Experts [0.7366405857677227]
Mixture of experts (MoE) is the simplest gated modular neural network architecture.
We show that the original MoE architecture and its training method do not guarantee intuitive task decompositions and good expert utilization.
We introduce a novel gating architecture, similar to attention, that improves performance and results in a lower entropy task decomposition.
arXiv Detail & Related papers (2023-02-28T16:16:45Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.