Prediction of speech intelligibility with DNN-based performance measures
- URL: http://arxiv.org/abs/2203.09148v1
- Date: Thu, 17 Mar 2022 08:05:38 GMT
- Title: Prediction of speech intelligibility with DNN-based performance measures
- Authors: Angel Mario Castro Martinez, Constantin Spille, Jana Ro{\ss}bach,
Birger Kollmeier, Bernd T. Meyer
- Abstract summary: This paper presents a speech intelligibility model based on automatic speech recognition (ASR)
It combines phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities.
The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.
- Score: 9.883633991083789
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a speech intelligibility model based on automatic speech
recognition (ASR), combining phoneme probabilities from deep neural networks
(DNN) and a performance measure that estimates the word error rate from these
probabilities. This model does not require the clean speech reference nor the
word labels during testing as the ASR decoding step, which finds the most
likely sequence of words given phoneme posterior probabilities, is omitted. The
model is evaluated via the root-mean-squared error between the predicted and
observed speech reception thresholds from eight normal-hearing listeners. The
recognition task consists of identifying noisy words from a German matrix
sentence test. The speech material was mixed with eight noise maskers covering
different modulation types, from speech-shaped stationary noise to a
single-talker masker. The prediction performance is compared to five
established models and an ASR-model using word labels. Two combinations of
features and networks were tested. Both include temporal information either at
the feature level (amplitude modulation filterbanks and a feed-forward network)
or captured by the architecture (mel-spectrograms and a time-delay deep neural
network, TDNN). The TDNN model is on par with the DNN while reducing the number
of parameters by a factor of 37; this optimization allows parallel streams on
dedicated hearing aid hardware as a forward-pass can be computed within the
10ms of each frame. The proposed model performs almost as well as the
label-based model and produces more accurate predictions than the baseline
models.
Related papers
- HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Visualising and Explaining Deep Learning Models for Speech Quality
Prediction [0.0]
The non-intrusive speech quality prediction model NISQA is analyzed in this paper.
It is composed of a convolutional neural network (CNN) and a recurrent neural network (RNN)
arXiv Detail & Related papers (2021-12-12T12:50:03Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - Effects of Number of Filters of Convolutional Layers on Speech
Recognition Model Accuracy [6.2698513174194215]
This paper studies the effects of Number of Filters of convolutional layers on the model prediction accuracy of CNN+RNN (Convolutional Networks adding to Recurrent Networks) for ASR Models (Automatic Speech Recognition)
Experimental results show that only when the CNN Number of Filters exceeds a certain threshold value is adding CNN to RNN able to improve the performance of the CNN+RNN speech recognition model.
arXiv Detail & Related papers (2021-02-03T23:04:38Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z) - DNN-Based Semantic Model for Rescoring N-best Speech Recognition List [8.934497552812012]
The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc.
This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features.
arXiv Detail & Related papers (2020-11-02T13:50:59Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.