Building a great multi-lingual teacher with sparsely-gated mixture of
experts for speech recognition
- URL: http://arxiv.org/abs/2112.05820v1
- Date: Fri, 10 Dec 2021 20:37:03 GMT
- Title: Building a great multi-lingual teacher with sparsely-gated mixture of
experts for speech recognition
- Authors: Kenichi Kumatani, Robert Gmyr, Felipe Cruz Salinas, Linquan Liu, Wei
Zuo, Devang Patel, Eric Sun and Yu Shi
- Abstract summary: Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity.
We apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence Transformer (S2S-T) and Transformer Transducer (T-T)
- Score: 13.64861164899787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The sparsely-gated Mixture of Experts (MoE) can magnify a network capacity
with a little computational complexity. In this work, we investigate how
multi-lingual Automatic Speech Recognition (ASR) networks can be scaled up with
a simple routing algorithm in order to achieve better accuracy. More
specifically, we apply the sparsely-gated MoE technique to two types of
networks: Sequence-to-Sequence Transformer (S2S-T) and Transformer Transducer
(T-T). We demonstrate through a set of ASR experiments on multiple language
data that the MoE networks can reduce the relative word error rates by 16.5\%
and 4.7\% with the S2S-T and T-T, respectively. Moreover, we thoroughly
investigate the effect of the MoE on the T-T architecture in various
conditions: streaming mode, non-streaming mode, the use of language ID and the
label decoder with the MoE.
Related papers
- Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper [3.717584661565119]
We demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch without supervised data.
This allows training a robust ASR model just in one stage and does not require large data and computational budget.
We validate the proposed framework on 6 languages from CommonVoice and propose multiple filters to filter out hallucinated PLs.
arXiv Detail & Related papers (2024-09-20T13:38:59Z) - Mechanistic Interpretability of Binary and Ternary Transformers [1.3715396507106912]
We investigate whether binary and ternary transformer networks learn distinctly different or similar algorithms when compared to full-precision transformer networks.
This provides evidence against the possibility of using binary and ternary networks as a more interpretable alternative in the Large Language Models setting.
arXiv Detail & Related papers (2024-05-27T23:22:23Z) - U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF [10.81723269312202]
Mixture-of-Experts (MoE) have been proposed as an energy efficient path to larger and more capable language models.
We benchmark our proposed model on a large scale inner-source dataset (160k hours)
arXiv Detail & Related papers (2024-04-25T08:34:21Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense
Prediction [40.447092963041236]
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer.
Our method, named DeMT, is based on a simple and effective encoder-decoder architecture.
Our model uses fewer GFLOPs and significantly outperforms current Transformer- and CNN-based competitive models.
arXiv Detail & Related papers (2023-01-09T16:00:15Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural
TTS [52.51848317549301]
We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis.
A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data.
In synthesis, the neural vocoder converts the predicted MSMCRs into final speech waveforms.
arXiv Detail & Related papers (2022-09-22T09:43:17Z) - SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts [29.582683923988203]
Mixture of Experts (MoE) based Transformer has shown promising results in many domains.
In this work, we explore the MoE based model for speech recognition, named SpeechMoE.
New router architecture is used in SpeechMoE which can simultaneously utilize the information from a shared embedding network.
arXiv Detail & Related papers (2021-05-07T02:38:23Z) - Dual-decoder Transformer for Joint Automatic Speech Recognition and
Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST)
Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - MetricUNet: Synergistic Image- and Voxel-Level Learning for Precise CT
Prostate Segmentation via Online Sampling [66.01558025094333]
We propose a two-stage framework, with the first stage to quickly localize the prostate region and the second stage to precisely segment the prostate.
We introduce a novel online metric learning module through voxel-wise sampling in the multi-task network.
Our method can effectively learn more representative voxel-level features compared with the conventional learning methods with cross-entropy or Dice loss.
arXiv Detail & Related papers (2020-05-15T10:37:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.