Investigating the Sensitivity of Automatic Speech Recognition Systems to
Phonetic Variation in L2 Englishes
- URL: http://arxiv.org/abs/2305.07389v1
- Date: Fri, 12 May 2023 11:29:13 GMT
- Title: Investigating the Sensitivity of Automatic Speech Recognition Systems to
Phonetic Variation in L2 Englishes
- Authors: Emma O'Neill and Julie Carson-Berndsen
- Abstract summary: This work demonstrates a method of probing an ASR system to discover how it handles phonetic variation across a number of L2 Englishes.
It is demonstrated that the behaviour of the ASR is systematic and consistent across speakers with similar spoken varieties.
- Score: 3.198144010381572
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Speech Recognition (ASR) systems exhibit the best performance on
speech that is similar to that on which it was trained. As such,
underrepresented varieties including regional dialects, minority-speakers, and
low-resource languages, see much higher word error rates (WERs) than those
varieties seen as 'prestigious', 'mainstream', or 'standard'. This can act as a
barrier to incorporating ASR technology into the annotation process for
large-scale linguistic research since the manual correction of the erroneous
automated transcripts can be just as time and resource consuming as manual
transcriptions. A deeper understanding of the behaviour of an ASR system is
thus beneficial from a speech technology standpoint, in terms of improving ASR
accuracy, and from an annotation standpoint, where knowing the likely errors
made by an ASR system can aid in this manual correction. This work demonstrates
a method of probing an ASR system to discover how it handles phonetic variation
across a number of L2 Englishes. Specifically, how particular phonetic
realisations which were rare or absent in the system's training data can lead
to phoneme level misrecognitions and contribute to higher WERs. It is
demonstrated that the behaviour of the ASR is systematic and consistent across
speakers with similar spoken varieties (in this case the same L1) and phoneme
substitution errors are typically in agreement with human annotators. By
identifying problematic productions specific weaknesses can be addressed by
sourcing such realisations for training and fine-tuning thus making the system
more robust to pronunciation variation.
Related papers
- Quantification of stylistic differences in human- and ASR-produced transcripts of African American English [1.8021379035665333]
Stylistic differences, such as verbatim vs non-verbatim, can play a significant role in ASR performance evaluation.
We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English speech.
We investigate the interactions of these categories with how well transcripts can be compared via word error rate.
arXiv Detail & Related papers (2024-09-04T20:18:59Z) - Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model.
We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z) - Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs)
To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods.
Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Cross-lingual Knowledge Transfer and Iterative Pseudo-labeling for
Low-Resource Speech Recognition with Transducers [6.017182111335404]
Cross-lingual knowledge transfer and iterative pseudo-labeling are two techniques that have been shown to be successful for improving the accuracy of ASR systems.
We show that the Transducer system trained using transcripts produced by the hybrid system achieves 18% reduction in terms of word error rate.
arXiv Detail & Related papers (2023-05-23T03:50:35Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Improving Distinction between ASR Errors and Speech Disfluencies with
Feature Space Interpolation [0.0]
Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing.
This paper proposes a scheme to improve existing LM-based ASR error detection systems.
arXiv Detail & Related papers (2021-08-04T02:11:37Z) - Hallucination of speech recognition errors with sequence to sequence
learning [16.39332236910586]
When plain text data is to be used to train systems for spoken language understanding or ASR, a proven strategy is to hallucinate what the ASR outputs would be given a gold transcription.
We present novel end-to-end models to directly predict hallucinated ASR word sequence outputs, conditioning on an input word sequence as well as a corresponding phoneme sequence.
This improves prior published results for recall of errors from an in-domain ASR system's transcription of unseen data, as well as an out-of-domain ASR system's transcriptions of audio from an unrelated task.
arXiv Detail & Related papers (2021-03-23T02:09:39Z) - Knowledge Distillation for Improved Accuracy in Spoken Question
Answering [63.72278693825945]
We devise a training strategy to perform knowledge distillation from spoken documents and written counterparts.
Our work makes a step towards distilling knowledge from the language model as a supervision signal.
Experiments demonstrate that our approach outperforms several state-of-the-art language models on the Spoken-SQuAD dataset.
arXiv Detail & Related papers (2020-10-21T15:18:01Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.