SpeechMoE2: Mixture-of-Experts Model with Improved Routing
- URL: http://arxiv.org/abs/2111.11831v1
- Date: Tue, 23 Nov 2021 12:53:16 GMT
- Title: SpeechMoE2: Mixture-of-Experts Model with Improved Routing
- Authors: Zhao You, Shulin Feng, Dan Su and Dong Yu
- Abstract summary: We propose a new router architecture which integrates additional global domain and accent embedding into router input to promote adaptability.
Experimental results show that the proposed SpeechMoE2 can achieve lower character error rate (CER) with comparable parameters.
- Score: 29.582683923988203
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Mixture-of-experts based acoustic models with dynamic routing mechanisms have
proved promising results for speech recognition. The design principle of router
architecture is important for the large model capacity and high computational
efficiency. Our previous work SpeechMoE only uses local grapheme embedding to
help routers to make route decisions. To further improve speech recognition
performance against varying domains and accents, we propose a new router
architecture which integrates additional global domain and accent embedding
into router input to promote adaptability. Experimental results show that the
proposed SpeechMoE2 can achieve lower character error rate (CER) with
comparable parameters than SpeechMoE on both multi-domain and multi-accent
task. Primarily, the proposed method provides up to 1.6% - 4.8% relative CER
improvement for the multidomain task and 1.9% - 17.7% relative CER improvement
for the multi-accent task respectively. Besides, increasing the number of
experts also achieves consistent performance improvement and keeps the
computational cost constant.
Related papers
- Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition [1.0690007351232649]
We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent.
Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$%$ and 17.2$%$ in character error rate (CER) across multi accent test datasets.
arXiv Detail & Related papers (2024-07-03T11:35:52Z) - Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications.
MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling.
Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic Control Using Multi-Objective Learning [6.60571587618006]
Radio speech echo is a specific phenomenon in the air traffic control (ATC) domain, which degrades speech quality and impacts automatic speech recognition (ASR) accuracy.
In this work, a time-domain recognition-oriented speech enhancement framework is proposed to improve speech intelligibility and advance ASR accuracy.
The framework serves as a plug-and-play tool in ATC scenarios and does not require additional retraining of the ASR model.
arXiv Detail & Related papers (2023-12-11T04:51:41Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Enhancing and Adversarial: Improve ASR with Speaker Labels [49.73714831258699]
We propose a novel adaptive gradient reversal layer for stable and effective adversarial training without tuning effort.
Detailed analysis and experimental verification are conducted to show the optimal positions in the ASR neural network (NN) to apply speaker enhancing and adversarial training.
Our best speaker-based MTL achieves 7% relative improvement on the Switchboard Hub5'00 set.
arXiv Detail & Related papers (2022-11-11T17:40:08Z) - Multi-turn RNN-T for streaming recognition of multi-party speech [2.899379040028688]
This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T)
We introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set.
We propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture.
arXiv Detail & Related papers (2021-12-19T17:22:58Z) - SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts [29.582683923988203]
Mixture of Experts (MoE) based Transformer has shown promising results in many domains.
In this work, we explore the MoE based model for speech recognition, named SpeechMoE.
New router architecture is used in SpeechMoE which can simultaneously utilize the information from a shared embedding network.
arXiv Detail & Related papers (2021-05-07T02:38:23Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.