Related papers: A Mixture of Expert Based Deep Neural Network for Improved ASR

A Mixture of Expert Based Deep Neural Network for Improved ASR

URL: http://arxiv.org/abs/2112.01025v1
Date: Thu, 2 Dec 2021 07:26:34 GMT
Title: A Mixture of Expert Based Deep Neural Network for Improved ASR
Authors: Vishwanath Pratap Singh, Shakti P. Rath, Abhishek Pandey
Abstract summary: MixNet is a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR) In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification. Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates.
Score: 4.993304210475779
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper presents a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR), termed as MixNet. Besides the conventional layers, such as fully connected layers in DNN-HMM and memory cells in LSTM-HMM, the model uses two additional layers based on Mixture of Experts (MoE). The first MoE layer operating at the input is based on pre-defined broad phonetic classes and the second layer operating at the penultimate layer is based on automatically learned acoustic classes. In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification. The ASR accuracy is expected to improve if the conventional architecture of acoustic model is modified to make them more suitable to account for such overlaps. MixNet is developed keeping this in mind. Analysis conducted by means of scatter diagram verifies that MoE indeed improves the separation between classes that translates to better ASR accuracy. Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates compared to the conventional models, namely, DNN and LSTM respectively, trained using sMBR criteria. In comparison to an existing method developed for phone-classification (by Eigen et al), our proposed method yields a significant improvement.

Related papers

Towards Foundational Models for Dynamical System Reconstruction: Hierarchical Meta-Learning via Mixture of Experts [0.7373617024876724]
We introduce MixER: Mixture of Expert Reconstructors, a novel sparse top-1 MoE layer employing a custom gating update algorithm based on $K$-means and least squares. Experiments validate MixER's capabilities, demonstrating efficient training and scalability to systems of up to ten ordinary parametric differential equations. Our layer underperforms state-of-the-art meta-learners in high-data regimes, particularly when each expert is constrained to process only a fraction of a dataset composed of highly related data points.
arXiv Detail & Related papers (2025-02-07T21:16:43Z)
Adaptive Cyber-Attack Detection in IIoT Using Attention-Based LSTM-CNN Models [0.23408308015481666]
This study presents the development and evaluation of an advanced Intrusion detection (IDS) based on a hybrid LSTM-convolution neural network (CNN)-Attention architecture. The research focuses on two key classification tasks: binary and multi-class classification. In binary classification, the model achieved near-perfect accuracy, while in multi-class classification, it maintained a high accuracy level (99.04%), effectively categorizing different attack types with a loss value of 0.0220%.
arXiv Detail & Related papers (2025-01-21T20:52:23Z)
MERLOT: A Distilled LLM-based Mixture-of-Experts Framework for Scalable Encrypted Traffic Classification [19.476061046309052]
We present a scalable mixture-of-expert (MoE) based refinement of distilled large language model optimized for encrypted traffic classification. Experiments on 10 datasets show superior or competitive performance over state-of-the-art models.
arXiv Detail & Related papers (2024-11-20T03:01:41Z)
MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization [49.00754561435518]
MSRS achieves competitive results in VSR and AVSR with 21.1% and 0.9% WER on the LRS3 benchmark, while reducing training time by at least 2x. We explore other sparse approaches and show that only MSRS enables training from scratch by implicitly masking the weights affected by vanishing gradients.
arXiv Detail & Related papers (2024-06-25T15:00:43Z)
Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z)
Model-based Deep Learning Receiver Design for Rate-Splitting Multiple Access [65.21117658030235]
This work proposes a novel design for a practical RSMA receiver based on model-based deep learning (MBDL) methods. The MBDL receiver is evaluated in terms of uncoded Symbol Error Rate (SER), throughput performance through Link-Level Simulations (LLS) and average training overhead. Results reveal that the MBDL outperforms by a significant margin the SIC receiver with imperfect CSIR.
arXiv Detail & Related papers (2022-05-02T12:23:55Z)
Conformer-based Hybrid ASR System for Switchboard Dataset [99.88988282353206]
We present and evaluate a competitive conformer-based hybrid model training recipe. We study different training aspects and methods to improve word-error-rate as well as to increase training speed. We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results.
arXiv Detail & Related papers (2021-11-05T12:03:18Z)
Improving Character Error Rate Is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Models [1.6328866317851185]
A deep neural network (DNN)-based speech enhancement (SE) is proposed in this paper. Our method uses two DNNs: one for speech processing and one for mimicking the output CERs derived through an acoustic model (AM) Experimental results show that our method improved CER by 7.3% relative derived through a black-box AM although certain noise levels are kept.
arXiv Detail & Related papers (2021-10-12T12:51:53Z)
Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization [21.216783537997426]
We propose an architecture that is able to better leverage the acoustic features provided by PANNs for the Automated Audio Captioning Task. We also introduce a novel self-supervised objective, Reconstruction Latent Space Similarity Regularization (RLSSR)
arXiv Detail & Related papers (2021-08-10T13:49:41Z)
Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers [54.47911829539919]
We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers. We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks. The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
arXiv Detail & Related papers (2021-02-09T08:19:49Z)
Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs) These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training. Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z)
High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model [46.34788932277904]
We improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition. To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks. We further improve the training strategy with sequence-level teacher-student learning.
arXiv Detail & Related papers (2020-03-17T00:52:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.