Mixture-of-Expert Conformer for Streaming Multilingual ASR
- URL: http://arxiv.org/abs/2305.15663v1
- Date: Thu, 25 May 2023 02:16:32 GMT
- Title: Mixture-of-Expert Conformer for Streaming Multilingual ASR
- Authors: Ke Hu, Bo Li, Tara N. Sainath, Yu Zhang, Francoise Beaufays
- Abstract summary: We propose a streaming truly multilingual Conformer incorporating mixture-of-expert layers.
The proposed MoE layer offers efficient inference by activating a fixed number of parameters as the number of experts increases.
We evaluate the proposed model on a set of 12 languages, and achieve an average 11.9% relative improvement in WER over the baseline.
- Score: 33.14594179710925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end models with large capacity have significantly improved
multilingual automatic speech recognition, but their computation cost poses
challenges for on-device applications. We propose a streaming truly
multilingual Conformer incorporating mixture-of-expert (MoE) layers that learn
to only activate a subset of parameters in training and inference. The MoE
layer consists of a softmax gate which chooses the best two experts among many
in forward propagation. The proposed MoE layer offers efficient inference by
activating a fixed number of parameters as the number of experts increases. We
evaluate the proposed model on a set of 12 languages, and achieve an average
11.9% relative improvement in WER over the baseline. Compared to an adapter
model using ground truth information, our MoE model achieves similar WER and
activates similar number of parameters but without any language information. We
further show around 3% relative WER improvement by multilingual shallow fusion.
Related papers
- Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models.
We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design.
Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z) - Efficient Compression of Multitask Multilingual Speech Models [0.0]
DistilWhisper is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities.
Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2.
arXiv Detail & Related papers (2024-05-02T03:11:59Z) - On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based
Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models.
We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z) - Towards Being Parameter-Efficient: A Stratified Sparsely Activated
Transformer with Dynamic Capacity [37.04254056062765]
Stratified Mixture of Experts (SMoE) models can assign dynamic capacity to different tokens.
We show SMoE outperforms multiple state-of-the-art MoE models with the same or fewer parameters.
arXiv Detail & Related papers (2023-05-03T15:18:18Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Scaling Up Deliberation for Multilingual ASR [36.860327600638705]
We investigate second-pass deliberation for multilingual speech recognition.
Our proposed deliberation is multilingual, i.e., the text encoder encodes hypothesis text from multiple languages, and the decoder attends to multilingual text and audio.
We show that deliberation improves the average WER on 9 languages by 4% relative compared to the single-pass model.
arXiv Detail & Related papers (2022-10-11T21:07:00Z) - 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech
recognition [31.992543274210835]
We identify and integrate several approaches to achieve further improvements for ASR tasks.
Specifically, multi-loss refers to the joint CTC/AED loss and multi-path denotes the Mixture-of-Experts(MoE) architecture.
We evaluate our proposed method on the public WenetSpeech dataset and experimental results show that the proposed method provides 12.2%-17.6% relative CER improvement.
arXiv Detail & Related papers (2022-04-07T03:10:49Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.