Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition
- URL: http://arxiv.org/abs/2408.02582v1
- Date: Mon, 5 Aug 2024 16:00:07 GMT
- Title: Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition
- Authors: Jaeyoung Kim, Han Lu, Soheil Khorram, Anshuman Tripathi, Qian Zhang, Hasim Sak,
- Abstract summary: We present accent clustering and mining schemes for fair speech recognition systems.
For accent recognition, we applied three schemes to overcome limited size of supervised accent data.
Fine-tuning ASR on the mined Indian accent speech showed 10.0% and 5.3% relative improvements compared to fine-tuning on the randomly sampled speech.
- Score: 18.90193320368228
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern automatic speech recognition (ASR) systems are typically trained on more than tens of thousands hours of speech data, which is one of the main factors for their great success. However, the distribution of such data is typically biased towards common accents or typical speech patterns. As a result, those systems often poorly perform on atypical accented speech. In this paper, we present accent clustering and mining schemes for fair speech recognition systems which can perform equally well on under-represented accented speech. For accent recognition, we applied three schemes to overcome limited size of supervised accent data: supervised or unsupervised pre-training, distributionally robust optimization (DRO) and unsupervised clustering. Three schemes can significantly improve the accent recognition model especially for unbalanced and small accented speech. Fine-tuning ASR on the mined Indian accent speech using the proposed supervised or unsupervised clustering schemes showed 10.0% and 5.3% relative improvements compared to fine-tuning on the randomly sampled speech, respectively.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Improving Self-supervised Pre-training using Accent-Specific Codebooks [48.409296549372414]
accent-aware adaptation technique for self-supervised learning.
On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches.
arXiv Detail & Related papers (2024-07-04T08:33:52Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Improving Fairness and Robustness in End-to-End Speech Recognition
through unsupervised clustering [49.069298478971696]
We present a privacy preserving approach to improve fairness and robustness of end-to-end ASR.
We extract utterance level embeddings using a speaker ID model trained on a public dataset.
We use cluster IDs instead of speaker utterance embeddings as extra features during model training.
arXiv Detail & Related papers (2023-06-06T21:13:08Z) - CommonAccent: Exploring Large Acoustic Pretrained Models for Accent
Classification Based on Common Voice [1.559929646151698]
We introduce a recipe aligned to the SpeechBrain toolkit for accent classification based on Common Voice 7.0 (English) and Common Voice 11.0 (Italian, German, and Spanish)
We establish new state-of-the-art for English accent classification with as high as 95% accuracy.
arXiv Detail & Related papers (2023-05-29T17:53:35Z) - Comparing Supervised Models And Learned Speech Representations For
Classifying Intelligibility Of Disordered Speech On Selected Phrases [11.3463024120429]
We develop and compare different deep learning techniques to classify the intelligibility of disordered speech on selected phrases.
We collected samples from a diverse set of 661 speakers with a variety of self-reported disorders speaking 29 words or phrases.
arXiv Detail & Related papers (2021-07-08T17:24:25Z) - Accented Speech Recognition: A Survey [0.0]
We present a survey of current promising approaches to accented speech recognition.
The resulting bias in ASR performance across accents comes at a cost to both users and providers of ASR.
arXiv Detail & Related papers (2021-04-21T20:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.