A Mixture of Expert Based Deep Neural Network for Improved ASR
- URL: http://arxiv.org/abs/2112.01025v1
- Date: Thu, 2 Dec 2021 07:26:34 GMT
- Title: A Mixture of Expert Based Deep Neural Network for Improved ASR
- Authors: Vishwanath Pratap Singh, Shakti P. Rath, Abhishek Pandey
- Abstract summary: MixNet is a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR)
In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification.
Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates.
- Score: 4.993304210475779
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents a novel deep learning architecture for acoustic model in
the context of Automatic Speech Recognition (ASR), termed as MixNet. Besides
the conventional layers, such as fully connected layers in DNN-HMM and memory
cells in LSTM-HMM, the model uses two additional layers based on Mixture of
Experts (MoE). The first MoE layer operating at the input is based on
pre-defined broad phonetic classes and the second layer operating at the
penultimate layer is based on automatically learned acoustic classes. In
natural speech, overlap in distribution across different acoustic classes is
inevitable, which leads to inter-class mis-classification. The ASR accuracy is
expected to improve if the conventional architecture of acoustic model is
modified to make them more suitable to account for such overlaps. MixNet is
developed keeping this in mind. Analysis conducted by means of scatter diagram
verifies that MoE indeed improves the separation between classes that
translates to better ASR accuracy. Experiments are conducted on a large
vocabulary ASR task which show that the proposed architecture provides 13.6%
and 10.0% relative reduction in word error rates compared to the conventional
models, namely, DNN and LSTM respectively, trained using sMBR criteria. In
comparison to an existing method developed for phone-classification (by Eigen
et al), our proposed method yields a significant improvement.
Related papers
- Towards Foundational Models for Dynamical System Reconstruction: Hierarchical Meta-Learning via Mixture of Experts [0.7373617024876724]
We introduce MixER: Mixture of Expert Reconstructors, a novel sparse top-1 MoE layer employing a custom gating update algorithm based on $K$-means and least squares.
Experiments validate MixER's capabilities, demonstrating efficient training and scalability to systems of up to ten ordinary parametric differential equations.
Our layer underperforms state-of-the-art meta-learners in high-data regimes, particularly when each expert is constrained to process only a fraction of a dataset composed of highly related data points.
arXiv Detail & Related papers (2025-02-07T21:16:43Z) - Adaptive Cyber-Attack Detection in IIoT Using Attention-Based LSTM-CNN Models [0.23408308015481666]
This study presents the development and evaluation of an advanced Intrusion detection (IDS) based on a hybrid LSTM-convolution neural network (CNN)-Attention architecture.
The research focuses on two key classification tasks: binary and multi-class classification.
In binary classification, the model achieved near-perfect accuracy, while in multi-class classification, it maintained a high accuracy level (99.04%), effectively categorizing different attack types with a loss value of 0.0220%.
arXiv Detail & Related papers (2025-01-21T20:52:23Z) - MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization [49.00754561435518]
MSRS achieves competitive results in VSR and AVSR with 21.1% and 0.9% WER on the LRS3 benchmark, while reducing training time by at least 2x.
We explore other sparse approaches and show that only MSRS enables training from scratch by implicitly masking the weights affected by vanishing gradients.
arXiv Detail & Related papers (2024-06-25T15:00:43Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Model-based Deep Learning Receiver Design for Rate-Splitting Multiple
Access [65.21117658030235]
This work proposes a novel design for a practical RSMA receiver based on model-based deep learning (MBDL) methods.
The MBDL receiver is evaluated in terms of uncoded Symbol Error Rate (SER), throughput performance through Link-Level Simulations (LLS) and average training overhead.
Results reveal that the MBDL outperforms by a significant margin the SIC receiver with imperfect CSIR.
arXiv Detail & Related papers (2022-05-02T12:23:55Z) - Conformer-based Hybrid ASR System for Switchboard Dataset [99.88988282353206]
We present and evaluate a competitive conformer-based hybrid model training recipe.
We study different training aspects and methods to improve word-error-rate as well as to increase training speed.
We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results.
arXiv Detail & Related papers (2021-11-05T12:03:18Z) - Automated Audio Captioning using Transfer Learning and Reconstruction
Latent Space Similarity Regularization [21.216783537997426]
We propose an architecture that is able to better leverage the acoustic features provided by PANNs for the Automated Audio Captioning Task.
We also introduce a novel self-supervised objective, Reconstruction Latent Space Similarity Regularization (RLSSR)
arXiv Detail & Related papers (2021-08-10T13:49:41Z) - Train your classifier first: Cascade Neural Networks Training from upper
layers to lower layers [54.47911829539919]
We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers.
We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks.
The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
arXiv Detail & Related papers (2021-02-09T08:19:49Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - High-Accuracy and Low-Latency Speech Recognition with Two-Head
Contextual Layer Trajectory LSTM Model [46.34788932277904]
We improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition.
To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks.
We further improve the training strategy with sequence-level teacher-student learning.
arXiv Detail & Related papers (2020-03-17T00:52:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.