Improving Character Error Rate Is Not Equal to Having Clean Speech:
Speech Enhancement for ASR Systems with Black-box Acoustic Models
- URL: http://arxiv.org/abs/2110.05968v1
- Date: Tue, 12 Oct 2021 12:51:53 GMT
- Title: Improving Character Error Rate Is Not Equal to Having Clean Speech:
Speech Enhancement for ASR Systems with Black-box Acoustic Models
- Authors: Ryosuke Sawata, Yosuke Kashiwagi and Shusuke Takahashi
- Abstract summary: A deep neural network (DNN)-based speech enhancement (SE) is proposed in this paper.
Our method uses two DNNs: one for speech processing and one for mimicking the output CERs derived through an acoustic model (AM)
Experimental results show that our method improved CER by 7.3% relative derived through a black-box AM although certain noise levels are kept.
- Score: 1.6328866317851185
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A deep neural network (DNN)-based speech enhancement (SE) aiming to maximize
the performance of an automatic speech recognition (ASR) system is proposed in
this paper. In order to optimize the DNN-based SE model in terms of the
character error rate (CER), which is one of the metric to evaluate the ASR
system and generally non-differentiable, our method uses two DNNs: one for
speech processing and one for mimicking the output CERs derived through an
acoustic model (AM). Then both of DNNs are alternately optimized in the
training phase. Even if the AM is a black-box, e.g., like one provided by a
third-party, the proposed method enables the DNN-based SE model to be optimized
in terms of the CER since the DNN mimicking the AM is differentiable.
Consequently, it becomes feasible to build CER-centric SE model that has no
negative effect, e.g., additional calculation cost and changing network
architecture, on the inference phase since our method is merely a training
scheme for the existing DNN-based methods. Experimental results show that our
method improved CER by 7.3% relative derived through a black-box AM although
certain noise levels are kept.
Related papers
- Enhancing Deep Neural Network Training Efficiency and Performance through Linear Prediction [0.0]
Deep neural networks (DNN) have achieved remarkable success in various fields, including computer vision and natural language processing.
This paper aims to propose a method to optimize the training effectiveness of DNN, with the goal of improving model performance.
arXiv Detail & Related papers (2023-10-17T03:11:30Z) - Integrate Lattice-Free MMI into End-to-End Speech Recognition [87.01137882072322]
In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems.
With this motivation, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems.
Previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems.
In this work, novel algorithms are proposed in this work to integrate another widely used discriminative criterion, lattice-free maximum mutual information (LF-MMI) into E2E
arXiv Detail & Related papers (2022-03-29T14:32:46Z) - A Mixture of Expert Based Deep Neural Network for Improved ASR [4.993304210475779]
MixNet is a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR)
In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification.
Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates.
arXiv Detail & Related papers (2021-12-02T07:26:34Z) - Meta-Learning with Neural Tangent Kernels [58.06951624702086]
We propose the first meta-learning paradigm in the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model's Neural Tangent Kernel (NTK)
Within this paradigm, we introduce two meta-learning algorithms, which no longer need a sub-optimal iterative inner-loop adaptation as in the MAML framework.
We achieve this goal by 1) replacing the adaptation with a fast-adaptive regularizer in the RKHS; and 2) solving the adaptation analytically based on the NTK theory.
arXiv Detail & Related papers (2021-02-07T20:53:23Z) - Deep Time Delay Neural Network for Speech Enhancement with Full Data
Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z) - DNN-Based Semantic Model for Rescoring N-best Speech Recognition List [8.934497552812012]
The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc.
This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features.
arXiv Detail & Related papers (2020-11-02T13:50:59Z) - Attention Driven Fusion for Multi-Modal Emotion Recognition [39.295892047505816]
We present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification.
We use a SincNet layer, based on parameterized sinc functions with band-pass filters, to extract acoustic features from raw audio followed by a DCNN.
For text processing, we use two branches (a DCNN and a Bi-direction RNN followed by a DCNN) in parallel where cross attention is introduced to infer the N-gram level correlations.
arXiv Detail & Related papers (2020-09-23T08:07:58Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.