Related papers: English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech Recognition System

English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech Recognition System

URL: http://arxiv.org/abs/2105.05041v1
Date: Sun, 9 May 2021 08:24:33 GMT
Title: English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech Recognition System
Authors: Guillermo C\'ambara, Alex Peir\'o-Lilja, Mireia Farr\'us, Jordi Luque
Abstract summary: We evaluate a state-of-the-art automatic speech recognition model, using unseen data from a corpus with a wide variety of labeled English accents. We show that there is indeed an accuracy bias in terms of accentual variety, favoring the accents most prevalent in the training corpus.
Score: 3.4888132404740797
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Nowadays, research in speech technologies has gotten a lot out thanks to recently created public domain corpora that contain thousands of recording hours. These large amounts of data are very helpful for training the new complex models based on deep learning technologies. However, the lack of dialectal diversity in a corpus is known to cause performance biases in speech systems, mainly for underrepresented dialects. In this work, we propose to evaluate a state-of-the-art automatic speech recognition (ASR) deep learning-based model, using unseen data from a corpus with a wide variety of labeled English accents from different countries around the world. The model has been trained with 44.5K hours of English speech from an open access corpus called Multilingual LibriSpeech, showing remarkable results in popular benchmarks. We test the accuracy of such ASR against samples extracted from another public corpus that is continuously growing, the Common Voice dataset. Then, we present graphically the accuracy in terms of Word Error Rate of each of the different English included accents, showing that there is indeed an accuracy bias in terms of accentual variety, favoring the accents most prevalent in the training corpus.

Related papers

Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling [0.0]
This study assesses five cutting-edge ASR systems' recognition of non-native English accented speech using recordings from the L2-ARCTIC corpus. For read speech, Whisper and AssemblyAI achieved the best accuracy with mean Match Error Rates (MER) of 0.054 and 0.056 respectively. For spontaneous speech, RevAI performed best with a mean MER of 0.063.
arXiv Detail & Related papers (2025-03-10T05:09:44Z)
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z)
Improving Speech Recognition for African American English With Audio Classification [17.785482810741367]
We propose a new way to improve the robustness of a US English short-form speech recognizer using a small amount of out-of-domain data. Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.
arXiv Detail & Related papers (2023-09-16T19:57:45Z)
A Deep Dive into the Disparity of Word Error Rates Across Thousands of NPTEL MOOC Videos [4.809236881780707]
We describe the curation of a massive speech dataset of 8740 hours consisting of $sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India.
arXiv Detail & Related papers (2023-07-20T05:03:00Z)
Some voices are too common: Building fair speech recognition systems using the Common Voice dataset [2.28438857884398]
We use the French Common Voice dataset to quantify the biases of a pre-trained wav2vec2.0 model toward several demographic groups. We also run an in-depth analysis of the Common Voice corpus and identify important shortcomings that should be taken into account.
arXiv Detail & Related papers (2023-06-01T11:42:34Z)
Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings [0.0]
This study aims to investigate whether a model pre-trained on English corpus can be used on a target low-resource language. It was found that the model pre-trained on a different language but on a corpus with a huge amount of speakers performs well on samples with language mismatch.
arXiv Detail & Related papers (2022-09-26T11:49:37Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech Recognition [80.87085897419982]
We propose a novel acoustic modeling technique for accurate multi-dialect speech recognition with a single AM. Our proposed AM is dynamically adapted based on both dialect information and its internal representation, which results in a highly adaptive AM for handling multiple dialects simultaneously. The experimental results on large scale speech datasets show that the proposed AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11% relative compared to a single all-dialects AM and by 7.31% relative compared to dialect-specific AMs.
arXiv Detail & Related papers (2022-05-06T06:07:09Z)
Deep Speech Based End-to-End Automated Speech Recognition (ASR) for Indian-English Accents [0.0]
We have used transfer learning approach to develop an end-to-end speech recognition system for Indian-English accents. Indic TTS data of Indian-English accents is used for transfer learning and fine-tuning the pre-trained Deep Speech model.
arXiv Detail & Related papers (2022-04-03T03:11:21Z)
Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect. dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect. We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Unsupervised Cross-lingual Representation Learning for Speech Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations. Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.