Word Error Rate Estimation Without ASR Output: e-WER2
- URL: http://arxiv.org/abs/2008.03403v1
- Date: Sat, 8 Aug 2020 00:19:09 GMT
- Title: Word Error Rate Estimation Without ASR Output: e-WER2
- Authors: Ahmed Ali and Steve Renals
- Abstract summary: We use a multistream end-to-end architecture to estimate the word error rate (WER) of speech recognition systems.
We report results for systems using internal speech decoder features (glass-box), systems without speech decoder features (black-box) and for systems without access to the ASR system (no-box)
Considering WER per sentence, our no-box system achieves 0.56 Pearson correlation with the reference evaluation and 0.24 root mean square error (RMSE) across 1,400 sentences.
- Score: 36.43741370454534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Measuring the performance of automatic speech recognition (ASR) systems
requires manually transcribed data in order to compute the word error rate
(WER), which is often time-consuming and expensive. In this paper, we continue
our effort in estimating WER using acoustic, lexical and phonotactic features.
Our novel approach to estimate the WER uses a multistream end-to-end
architecture. We report results for systems using internal speech decoder
features (glass-box), systems without speech decoder features (black-box), and
for systems without having access to the ASR system (no-box). The no-box system
learns joint acoustic-lexical representation from phoneme recognition results
along with MFCC acoustic features to estimate WER. Considering WER per
sentence, our no-box system achieves 0.56 Pearson correlation with the
reference evaluation and 0.24 root mean square error (RMSE) across 1,400
sentences. The estimated overall WER by e-WER2 is 30.9% for a three hours test
set, while the WER computed using the reference transcriptions was 28.5%.
Related papers
- MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Unified End-to-End Speech Recognition and Endpointing for Fast and
Efficient Speech Systems [17.160006765475988]
We propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) model.
We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model.
This results in a single E2E model that can be used during inference to perform frame filtering at low cost.
arXiv Detail & Related papers (2022-11-01T23:43:15Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Cross-Modal ASR Post-Processing System for Error Correction and
Utterance Rejection [25.940199825317073]
We propose a cross-modal post-processing system for speech recognizers.
It fuses acoustic features and textual features from different modalities.
It joints a confidence estimator and an error corrector in multi-task learning fashion.
arXiv Detail & Related papers (2022-01-10T12:29:55Z) - Low Resource German ASR with Untranscribed Data Spoken by Non-native
Children -- INTERSPEECH 2021 Shared Task SPAPL System [19.435571932141364]
This paper describes the SPAPL system for the INTERSPEECH 2021 Challenge: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech in German.
5 hours of transcribed data and 60 hours of untranscribed data are provided to develop a German ASR system for children.
For the training of the transcribed data, we propose a non-speech state discriminative loss (NSDL) to mitigate the influence of long-duration non-speech segments within speech utterances.
Our system achieves a word error rate (WER) of 39.68% on the evaluation data,
arXiv Detail & Related papers (2021-06-18T07:36:26Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.