Related papers: Investigating the Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes

Investigating the Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes

URL: http://arxiv.org/abs/2305.07389v1
Date: Fri, 12 May 2023 11:29:13 GMT
Title: Investigating the Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes
Authors: Emma O'Neill and Julie Carson-Berndsen
Abstract summary: This work demonstrates a method of probing an ASR system to discover how it handles phonetic variation across a number of L2 Englishes. It is demonstrated that the behaviour of the ASR is systematic and consistent across speakers with similar spoken varieties.
Score: 3.198144010381572
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic Speech Recognition (ASR) systems exhibit the best performance on speech that is similar to that on which it was trained. As such, underrepresented varieties including regional dialects, minority-speakers, and low-resource languages, see much higher word error rates (WERs) than those varieties seen as 'prestigious', 'mainstream', or 'standard'. This can act as a barrier to incorporating ASR technology into the annotation process for large-scale linguistic research since the manual correction of the erroneous automated transcripts can be just as time and resource consuming as manual transcriptions. A deeper understanding of the behaviour of an ASR system is thus beneficial from a speech technology standpoint, in terms of improving ASR accuracy, and from an annotation standpoint, where knowing the likely errors made by an ASR system can aid in this manual correction. This work demonstrates a method of probing an ASR system to discover how it handles phonetic variation across a number of L2 Englishes. Specifically, how particular phonetic realisations which were rare or absent in the system's training data can lead to phoneme level misrecognitions and contribute to higher WERs. It is demonstrated that the behaviour of the ASR is systematic and consistent across speakers with similar spoken varieties (in this case the same L1) and phoneme substitution errors are typically in agreement with human annotators. By identifying problematic productions specific weaknesses can be addressed by sourcing such realisations for training and fine-tuning thus making the system more robust to pronunciation variation.

Related papers

Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling [0.0]
This study assesses five cutting-edge ASR systems' recognition of non-native English accented speech using recordings from the L2-ARCTIC corpus. For read speech, Whisper and AssemblyAI achieved the best accuracy with mean Match Error Rates (MER) of 0.054 and 0.056 respectively. For spontaneous speech, RevAI performed best with a mean MER of 0.063.
arXiv Detail & Related papers (2025-03-10T05:09:44Z)
Quantification of stylistic differences in human- and ASR-produced transcripts of African American English [1.8021379035665333]
Stylistic differences, such as verbatim vs non-verbatim, can play a significant role in ASR performance evaluation. We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English speech. We investigate the interactions of these categories with how well transcripts can be compared via word error rate.
arXiv Detail & Related papers (2024-09-04T20:18:59Z)
Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model. We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z)
Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs) To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods. Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z)
Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems. We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems. We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z)
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)
Cross-lingual Knowledge Transfer and Iterative Pseudo-labeling for Low-Resource Speech Recognition with Transducers [6.017182111335404]
Cross-lingual knowledge transfer and iterative pseudo-labeling are two techniques that have been shown to be successful for improving the accuracy of ASR systems. We show that the Transducer system trained using transcripts produced by the hybrid system achieves 18% reduction in terms of word error rate.
arXiv Detail & Related papers (2023-05-23T03:50:35Z)
Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances. Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems. The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z)
Improving Distinction between ASR Errors and Speech Disfluencies with Feature Space Interpolation [0.0]
Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing. This paper proposes a scheme to improve existing LM-based ASR error detection systems.
arXiv Detail & Related papers (2021-08-04T02:11:37Z)
Hallucination of speech recognition errors with sequence to sequence learning [16.39332236910586]
When plain text data is to be used to train systems for spoken language understanding or ASR, a proven strategy is to hallucinate what the ASR outputs would be given a gold transcription. We present novel end-to-end models to directly predict hallucinated ASR word sequence outputs, conditioning on an input word sequence as well as a corresponding phoneme sequence. This improves prior published results for recall of errors from an in-domain ASR system's transcription of unseen data, as well as an out-of-domain ASR system's transcriptions of audio from an unrelated task.
arXiv Detail & Related papers (2021-03-23T02:09:39Z)
Knowledge Distillation for Improved Accuracy in Spoken Question Answering [63.72278693825945]
We devise a training strategy to perform knowledge distillation from spoken documents and written counterparts. Our work makes a step towards distilling knowledge from the language model as a supervision signal. Experiments demonstrate that our approach outperforms several state-of-the-art language models on the Spoken-SQuAD dataset.
arXiv Detail & Related papers (2020-10-21T15:18:01Z)
Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR) APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker. We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.