SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts
- URL: http://arxiv.org/abs/2105.03036v1
- Date: Fri, 7 May 2021 02:38:23 GMT
- Title: SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts
- Authors: Zhao You, Shulin Feng, Dan Su and Dong Yu
- Abstract summary: Mixture of Experts (MoE) based Transformer has shown promising results in many domains.
In this work, we explore the MoE based model for speech recognition, named SpeechMoE.
New router architecture is used in SpeechMoE which can simultaneously utilize the information from a shared embedding network.
- Score: 29.582683923988203
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Recently, Mixture of Experts (MoE) based Transformer has shown promising
results in many domains. This is largely due to the following advantages of
this architecture: firstly, MoE based Transformer can increase model capacity
without computational cost increasing both at training and inference time.
Besides, MoE based Transformer is a dynamic network which can adapt to the
varying complexity of input instances in realworld applications. In this work,
we explore the MoE based model for speech recognition, named SpeechMoE. To
further control the sparsity of router activation and improve the diversity of
gate values, we propose a sparsity L1 loss and a mean importance loss
respectively. In addition, a new router architecture is used in SpeechMoE which
can simultaneously utilize the information from a shared embedding network and
the hierarchical representation of different MoE layers. Experimental results
show that SpeechMoE can achieve lower character error rate (CER) with
comparable computation cost than traditional static networks, providing
7.0%-23.0% relative CER improvements on four evaluation datasets.
Related papers
- Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs.
Current MoE models often display parameter inefficiency.
We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z) - Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules [96.21649779507831]
We propose a novel architecture dubbed mixture-of-modules (MoM)
MoM is motivated by an intuition that any layer, regardless of its position, can be used to compute a token.
We show that MoM provides not only a unified framework for Transformers but also a flexible and learnable approach for reducing redundancy.
arXiv Detail & Related papers (2024-07-09T08:50:18Z) - U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF [10.81723269312202]
Mixture-of-Experts (MoE) have been proposed as an energy efficient path to larger and more capable language models.
We benchmark our proposed model on a large scale inner-source dataset (160k hours)
arXiv Detail & Related papers (2024-04-25T08:34:21Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training [18.68993910156101]
We propose TA-MoE, a topology-aware routing strategy for large-scale MoE trainging.
We show that TA-MoE can substantially outperform its counterparts on various hardware and model configurations.
arXiv Detail & Related papers (2023-02-20T11:18:24Z) - AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for
Efficient Neural Machine Translation [104.0979785739202]
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks.
Existing MoE models mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network.
We develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints.
arXiv Detail & Related papers (2022-10-14T05:32:17Z) - Building a great multi-lingual teacher with sparsely-gated mixture of
experts for speech recognition [13.64861164899787]
Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity.
We apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence Transformer (S2S-T) and Transformer Transducer (T-T)
arXiv Detail & Related papers (2021-12-10T20:37:03Z) - SpeechMoE2: Mixture-of-Experts Model with Improved Routing [29.582683923988203]
We propose a new router architecture which integrates additional global domain and accent embedding into router input to promote adaptability.
Experimental results show that the proposed SpeechMoE2 can achieve lower character error rate (CER) with comparable parameters.
arXiv Detail & Related papers (2021-11-23T12:53:16Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.