The Performance Evaluation of Attention-Based Neural ASR under Mixed
Speech Input
- URL: http://arxiv.org/abs/2108.01245v1
- Date: Tue, 3 Aug 2021 02:08:22 GMT
- Title: The Performance Evaluation of Attention-Based Neural ASR under Mixed
Speech Input
- Authors: Bradley He, Martin Radfar
- Abstract summary: We present mixtures of speech signals to a popular attention-based neural ASR, known as Listen, Attend, and Spell (LAS)
In particular, we investigate in details when two phonemes are mixed what will be the predicted phoneme.
Our results show the model, when presented with mixed phonemes signals, tend to predict those that have higher accuracies.
- Score: 1.776746672434207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In order to evaluate the performance of the attention based neural ASR under
noisy conditions, the current trend is to present hours of various noisy speech
data to the model and measure the overall word/phoneme error rate (W/PER). In
general, it is unclear how these models perform when exposed to a cocktail
party setup in which two or more speakers are active. In this paper, we present
the mixtures of speech signals to a popular attention-based neural ASR, known
as Listen, Attend, and Spell (LAS), at different target-to-interference ratio
(TIR) and measure the phoneme error rate. In particular, we investigate in
details when two phonemes are mixed what will be the predicted phoneme; in this
fashion we build a model in which the most probable predictions for a phoneme
are given. We found a 65% relative increase in PER when LAS was presented with
mixed speech signals at TIR = 0 dB and the performance approaches the unmixed
scenario at TIR = 30 dB. Our results show the model, when presented with mixed
phonemes signals, tend to predict those that have higher accuracies during
evaluation of original phoneme signals.
Related papers
- AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications.
We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z) - Predicting pairwise preferences between TTS audio stimuli using parallel
ratings data and anti-symmetric twin neural networks [24.331098975217596]
We propose a model based on anti-symmetric twin neural networks, trained on pairs of waveforms and their corresponding preference scores.
To obtain a large training set we convert listeners' ratings from MUSHRA tests to values that reflect how often one stimulus in the pair was rated higher than the other.
Our results compare favourably to a state-of-the-art model trained to predict MOS scores.
arXiv Detail & Related papers (2022-09-22T13:34:22Z) - MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility
Prediction Model for Hearing Aids [22.736703635666164]
We propose a multi-branched speech intelligibility prediction model (MBI-Net) for predicting subjective intelligibility scores of hearing aid (HA) users.
The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores.
arXiv Detail & Related papers (2022-04-07T09:13:44Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Prediction of speech intelligibility with DNN-based performance measures [9.883633991083789]
This paper presents a speech intelligibility model based on automatic speech recognition (ASR)
It combines phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities.
The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.
arXiv Detail & Related papers (2022-03-17T08:05:38Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Exploiting Attention-based Sequence-to-Sequence Architectures for Sound
Event Localization [113.19483349876668]
This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model.
It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions.
arXiv Detail & Related papers (2021-02-28T07:52:20Z) - Extracting the Locus of Attention at a Cocktail Party from Single-Trial
EEG using a Joint CNN-LSTM Model [0.1529342790344802]
Human brain performs remarkably well in segregating a particular speaker from interfering speakers in a multi-speaker scenario.
We present a joint convolutional neural network (CNN) - long short-term memory (LSTM) model to infer the auditory attention.
arXiv Detail & Related papers (2021-02-08T01:06:48Z) - DNN-Based Semantic Model for Rescoring N-best Speech Recognition List [8.934497552812012]
The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc.
This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features.
arXiv Detail & Related papers (2020-11-02T13:50:59Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.