AccentFold: A Journey through African Accents for Zero-Shot ASR
Adaptation to Target Accents
- URL: http://arxiv.org/abs/2402.01152v2
- Date: Mon, 5 Feb 2024 05:45:59 GMT
- Title: AccentFold: A Journey through African Accents for Zero-Shot ASR
Adaptation to Target Accents
- Authors: Abraham Toluwase Owodunni, Aditya Yadavalli, Chris Chinenye Emezue,
Tobi Olatunji, Clinton C Mbataku
- Abstract summary: We propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve Automatic Speech Recognition (ASR)
Our exploratory analysis of speech embeddings representing 100+ African accents reveals interesting spatial accent relationships.
Our findings emphasize the potential of leveraging linguistic relationships to improve ASR adaptation to target accents.
- Score: 5.746007214645182
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite advancements in speech recognition, accented speech remains
challenging. While previous approaches have focused on modeling techniques or
creating accented speech datasets, gathering sufficient data for the multitude
of accents, particularly in the African context, remains impractical due to
their sheer diversity and associated budget constraints. To address these
challenges, we propose AccentFold, a method that exploits spatial relationships
between learned accent embeddings to improve downstream Automatic Speech
Recognition (ASR). Our exploratory analysis of speech embeddings representing
100+ African accents reveals interesting spatial accent relationships
highlighting geographic and genealogical similarities, capturing consistent
phonological, and morphological regularities, all learned empirically from
speech. Furthermore, we discover accent relationships previously
uncharacterized by the Ethnologue. Through empirical evaluation, we demonstrate
the effectiveness of AccentFold by showing that, for out-of-distribution (OOD)
accents, sampling accent subsets for training based on AccentFold information
outperforms strong baselines a relative WER improvement of 4.6%. AccentFold
presents a promising approach for improving ASR performance on accented speech,
particularly in the context of African accents, where data scarcity and budget
constraints pose significant challenges. Our findings emphasize the potential
of leveraging linguistic relationships to improve zero-shot ASR adaptation to
target accents.
Related papers
- Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS [52.89324095217975]
Previous approaches on accent conversion mainly aimed at making non-native speech sound more native.
We develop a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker.
arXiv Detail & Related papers (2024-10-19T06:12:31Z) - Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Improving Self-supervised Pre-training using Accent-Specific Codebooks [48.409296549372414]
accent-aware adaptation technique for self-supervised learning.
On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches.
arXiv Detail & Related papers (2024-07-04T08:33:52Z) - Transfer the linguistic representations from TTS to accent conversion
with non-parallel data [7.376032484438044]
Accent conversion aims to convert the accent of a source speech to a target accent, preserving the speaker's identity.
This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech.
arXiv Detail & Related papers (2024-01-07T16:39:34Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - Synthetic Cross-accent Data Augmentation for Automatic Speech
Recognition [18.154258453839066]
We improve an accent-conversion model (ACM) which transforms native US-English speech into accented pronunciation.
We include phonetic knowledge in the ACM training to provide accurate feedback about how well certain pronunciation patterns were recovered in the synthesized waveform.
We evaluate our approach on native and non-native English datasets and found that synthetically accented data helped the ASR to better understand speech from seen accents.
arXiv Detail & Related papers (2023-03-01T20:05:19Z) - Deep Discriminative Feature Learning for Accent Recognition [14.024346215923972]
We adopt Convolutional Recurrent Neural Network as front-end encoder and integrate local features using Recurrent Neural Network to make an utterance-level accent representation.
We show that our proposed network with discriminative training method is significantly ahead of the baseline system on the accent classification track in the Accented English Speech Recognition Challenge 2020.
arXiv Detail & Related papers (2020-11-25T00:46:47Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Black-box Adaptation of ASR for Accented Speech [52.63060669715216]
We introduce the problem of adapting a black-box, cloud-based ASR system to speech from a target accent.
We propose a novel coupling of an open-source accent-tuned local model with the black-box service.
Our fine-grained merging algorithm is better at fixing accent errors than existing word-level combination strategies.
arXiv Detail & Related papers (2020-06-24T07:07:49Z) - AccentDB: A Database of Non-Native English Accents to Assist Neural
Speech Recognition [3.028098724882708]
We first spell out the key requirements for creating a well-curated database of speech samples in non-native accents for training and testing robust ASR systems.
We then introduce AccentDB, one such database that contains samples of 4 Indian-English accents collected by us.
We present several accent classification models and evaluate them thoroughly against human-labelled accent classes.
arXiv Detail & Related papers (2020-05-16T12:38:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.