A Mixture of Expert Based Deep Neural Network for Improved ASR
- URL: http://arxiv.org/abs/2112.01025v1
- Date: Thu, 2 Dec 2021 07:26:34 GMT
- Title: A Mixture of Expert Based Deep Neural Network for Improved ASR
- Authors: Vishwanath Pratap Singh, Shakti P. Rath, Abhishek Pandey
- Abstract summary: MixNet is a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR)
In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification.
Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates.
- Score: 4.993304210475779
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents a novel deep learning architecture for acoustic model in
the context of Automatic Speech Recognition (ASR), termed as MixNet. Besides
the conventional layers, such as fully connected layers in DNN-HMM and memory
cells in LSTM-HMM, the model uses two additional layers based on Mixture of
Experts (MoE). The first MoE layer operating at the input is based on
pre-defined broad phonetic classes and the second layer operating at the
penultimate layer is based on automatically learned acoustic classes. In
natural speech, overlap in distribution across different acoustic classes is
inevitable, which leads to inter-class mis-classification. The ASR accuracy is
expected to improve if the conventional architecture of acoustic model is
modified to make them more suitable to account for such overlaps. MixNet is
developed keeping this in mind. Analysis conducted by means of scatter diagram
verifies that MoE indeed improves the separation between classes that
translates to better ASR accuracy. Experiments are conducted on a large
vocabulary ASR task which show that the proposed architecture provides 13.6%
and 10.0% relative reduction in word error rates compared to the conventional
models, namely, DNN and LSTM respectively, trained using sMBR criteria. In
comparison to an existing method developed for phone-classification (by Eigen
et al), our proposed method yields a significant improvement.
Related papers
- MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization [49.00754561435518]
MSRS achieves competitive results in VSR and AVSR with 21.1% and 0.9% WER on the LRS3 benchmark, while reducing training time by at least 2x.
We explore other sparse approaches and show that only MSRS enables training from scratch by implicitly masking the weights affected by vanishing gradients.
arXiv Detail & Related papers (2024-06-25T15:00:43Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Model-based Deep Learning Receiver Design for Rate-Splitting Multiple
Access [65.21117658030235]
This work proposes a novel design for a practical RSMA receiver based on model-based deep learning (MBDL) methods.
The MBDL receiver is evaluated in terms of uncoded Symbol Error Rate (SER), throughput performance through Link-Level Simulations (LLS) and average training overhead.
Results reveal that the MBDL outperforms by a significant margin the SIC receiver with imperfect CSIR.
arXiv Detail & Related papers (2022-05-02T12:23:55Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Conformer-based Hybrid ASR System for Switchboard Dataset [99.88988282353206]
We present and evaluate a competitive conformer-based hybrid model training recipe.
We study different training aspects and methods to improve word-error-rate as well as to increase training speed.
We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results.
arXiv Detail & Related papers (2021-11-05T12:03:18Z) - Improving Character Error Rate Is Not Equal to Having Clean Speech:
Speech Enhancement for ASR Systems with Black-box Acoustic Models [1.6328866317851185]
A deep neural network (DNN)-based speech enhancement (SE) is proposed in this paper.
Our method uses two DNNs: one for speech processing and one for mimicking the output CERs derived through an acoustic model (AM)
Experimental results show that our method improved CER by 7.3% relative derived through a black-box AM although certain noise levels are kept.
arXiv Detail & Related papers (2021-10-12T12:51:53Z) - Automated Audio Captioning using Transfer Learning and Reconstruction
Latent Space Similarity Regularization [21.216783537997426]
We propose an architecture that is able to better leverage the acoustic features provided by PANNs for the Automated Audio Captioning Task.
We also introduce a novel self-supervised objective, Reconstruction Latent Space Similarity Regularization (RLSSR)
arXiv Detail & Related papers (2021-08-10T13:49:41Z) - Train your classifier first: Cascade Neural Networks Training from upper
layers to lower layers [54.47911829539919]
We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers.
We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks.
The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
arXiv Detail & Related papers (2021-02-09T08:19:49Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - High-Accuracy and Low-Latency Speech Recognition with Two-Head
Contextual Layer Trajectory LSTM Model [46.34788932277904]
We improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition.
To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks.
We further improve the training strategy with sequence-level teacher-student learning.
arXiv Detail & Related papers (2020-03-17T00:52:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.