English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech
Recognition System
- URL: http://arxiv.org/abs/2105.05041v1
- Date: Sun, 9 May 2021 08:24:33 GMT
- Title: English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech
Recognition System
- Authors: Guillermo C\'ambara, Alex Peir\'o-Lilja, Mireia Farr\'us, Jordi Luque
- Abstract summary: We evaluate a state-of-the-art automatic speech recognition model, using unseen data from a corpus with a wide variety of labeled English accents.
We show that there is indeed an accuracy bias in terms of accentual variety, favoring the accents most prevalent in the training corpus.
- Score: 3.4888132404740797
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nowadays, research in speech technologies has gotten a lot out thanks to
recently created public domain corpora that contain thousands of recording
hours. These large amounts of data are very helpful for training the new
complex models based on deep learning technologies. However, the lack of
dialectal diversity in a corpus is known to cause performance biases in speech
systems, mainly for underrepresented dialects. In this work, we propose to
evaluate a state-of-the-art automatic speech recognition (ASR) deep
learning-based model, using unseen data from a corpus with a wide variety of
labeled English accents from different countries around the world. The model
has been trained with 44.5K hours of English speech from an open access corpus
called Multilingual LibriSpeech, showing remarkable results in popular
benchmarks. We test the accuracy of such ASR against samples extracted from
another public corpus that is continuously growing, the Common Voice dataset.
Then, we present graphically the accuracy in terms of Word Error Rate of each
of the different English included accents, showing that there is indeed an
accuracy bias in terms of accentual variety, favoring the accents most
prevalent in the training corpus.
Related papers
- Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Improving Speech Recognition for African American English With Audio
Classification [17.785482810741367]
We propose a new way to improve the robustness of a US English short-form speech recognizer using a small amount of out-of-domain data.
Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.
arXiv Detail & Related papers (2023-09-16T19:57:45Z) - A Deep Dive into the Disparity of Word Error Rates Across Thousands of
NPTEL MOOC Videos [4.809236881780707]
We describe the curation of a massive speech dataset of 8740 hours consisting of $sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography.
We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India.
arXiv Detail & Related papers (2023-07-20T05:03:00Z) - Some voices are too common: Building fair speech recognition systems
using the Common Voice dataset [2.28438857884398]
We use the French Common Voice dataset to quantify the biases of a pre-trained wav2vec2.0 model toward several demographic groups.
We also run an in-depth analysis of the Common Voice corpus and identify important shortcomings that should be taken into account.
arXiv Detail & Related papers (2023-06-01T11:42:34Z) - Effects of language mismatch in automatic forensic voice comparison
using deep learning embeddings [0.0]
This study aims to investigate whether a model pre-trained on English corpus can be used on a target low-resource language.
It was found that the model pre-trained on a different language but on a corpus with a huge amount of speakers performs well on samples with language mismatch.
arXiv Detail & Related papers (2022-09-26T11:49:37Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech
Recognition [80.87085897419982]
We propose a novel acoustic modeling technique for accurate multi-dialect speech recognition with a single AM.
Our proposed AM is dynamically adapted based on both dialect information and its internal representation, which results in a highly adaptive AM for handling multiple dialects simultaneously.
The experimental results on large scale speech datasets show that the proposed AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11% relative compared to a single all-dialects AM and by 7.31% relative compared to dialect-specific AMs.
arXiv Detail & Related papers (2022-05-06T06:07:09Z) - Deep Speech Based End-to-End Automated Speech Recognition (ASR) for
Indian-English Accents [0.0]
We have used transfer learning approach to develop an end-to-end speech recognition system for Indian-English accents.
Indic TTS data of Indian-English accents is used for transfer learning and fine-tuning the pre-trained Deep Speech model.
arXiv Detail & Related papers (2022-04-03T03:11:21Z) - Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect.
dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect.
We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.