Improving Speech Recognition for African American English With Audio
Classification
- URL: http://arxiv.org/abs/2309.09996v1
- Date: Sat, 16 Sep 2023 19:57:45 GMT
- Title: Improving Speech Recognition for African American English With Audio
Classification
- Authors: Shefali Garg, Zhouyuan Huo, Khe Chai Sim, Suzan Schwartz, Mason Chua,
Al\"ena Aks\"enova, Tsendsuren Munkhdalai, Levi King, Darryl Wright, Zion
Mengesha, Dongseong Hwang, Tara Sainath, Fran\c{c}oise Beaufays, Pedro Moreno
Mengibar
- Abstract summary: We propose a new way to improve the robustness of a US English short-form speech recognizer using a small amount of out-of-domain data.
Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.
- Score: 17.785482810741367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech recognition (ASR) systems have been shown to have large
quality disparities between the language varieties they are intended or
expected to recognize. One way to mitigate this is to train or fine-tune models
with more representative datasets. But this approach can be hindered by limited
in-domain data for training and evaluation. We propose a new way to improve the
robustness of a US English short-form speech recognizer using a small amount of
out-of-domain (long-form) African American English (AAE) data. We use CORAAL,
YouTube and Mozilla Common Voice to train an audio classifier to approximately
output whether an utterance is AAE or some other variety including Mainstream
American English (MAE). By combining the classifier output with coarse
geographic information, we can select a subset of utterances from a large
corpus of untranscribed short-form queries for semi-supervised learning at
scale. Fine-tuning on this data results in a 38.5% relative word error rate
disparity reduction between AAE and MAE without reducing MAE quality.
Related papers
- CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Self-supervised Speech Representations Still Struggle with African American Vernacular English [28.223877889211803]
Underperformance of ASR systems for speakers of marginalized language varieties is a well-documented phenomenon.
We investigate whether or not the recent wave of Self-Supervised Learning speech models can close the gap in ASR performance between AAVE and Mainstream American English.
arXiv Detail & Related papers (2024-08-26T13:29:25Z) - Improving Accented Speech Recognition with Multi-Domain Training [2.28438857884398]
We use speech audio representing four different French accents to create fine-tuning datasets that improve the robustness of pre-trained ASR models.
Our numerical experiments show that we can reduce error rates by up to 25% (relative) on African and Belgian accents.
arXiv Detail & Related papers (2023-03-14T14:10:16Z) - Synthetic Cross-accent Data Augmentation for Automatic Speech
Recognition [18.154258453839066]
We improve an accent-conversion model (ACM) which transforms native US-English speech into accented pronunciation.
We include phonetic knowledge in the ACM training to provide accurate feedback about how well certain pronunciation patterns were recovered in the synthesized waveform.
We evaluate our approach on native and non-native English datasets and found that synthetically accented data helped the ASR to better understand speech from seen accents.
arXiv Detail & Related papers (2023-03-01T20:05:19Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z) - English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech
Recognition System [3.4888132404740797]
We evaluate a state-of-the-art automatic speech recognition model, using unseen data from a corpus with a wide variety of labeled English accents.
We show that there is indeed an accuracy bias in terms of accentual variety, favoring the accents most prevalent in the training corpus.
arXiv Detail & Related papers (2021-05-09T08:24:33Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.