A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech
Recognition
- URL: http://arxiv.org/abs/2205.03027v1
- Date: Fri, 6 May 2022 06:07:09 GMT
- Title: A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech
Recognition
- Authors: Sanghyun Yoo, Inchul Song, Yoshua Bengio
- Abstract summary: We propose a novel acoustic modeling technique for accurate multi-dialect speech recognition with a single AM.
Our proposed AM is dynamically adapted based on both dialect information and its internal representation, which results in a highly adaptive AM for handling multiple dialects simultaneously.
The experimental results on large scale speech datasets show that the proposed AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11% relative compared to a single all-dialects AM and by 7.31% relative compared to dialect-specific AMs.
- Score: 80.87085897419982
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite the success of deep learning in speech recognition, multi-dialect
speech recognition remains a difficult problem. Although dialect-specific
acoustic models are known to perform well in general, they are not easy to
maintain when dialect-specific data is scarce and the number of dialects for
each language is large. Therefore, a single unified acoustic model (AM) that
generalizes well for many dialects has been in demand. In this paper, we
propose a novel acoustic modeling technique for accurate multi-dialect speech
recognition with a single AM. Our proposed AM is dynamically adapted based on
both dialect information and its internal representation, which results in a
highly adaptive AM for handling multiple dialects simultaneously. We also
propose a simple but effective training method to deal with unseen dialects.
The experimental results on large scale speech datasets show that the proposed
AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11%
relative compared to a single all-dialects AM and by 7.31% relative compared to
dialect-specific AMs.
Related papers
- Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules [64.93179829965072]
DADA is a modular approach to imbue SAE-trained models with multi-dialectal robustness.
We show that DADA is effective for both single task and instruction fine language models.
arXiv Detail & Related papers (2023-05-22T18:43:31Z) - Pre-Finetuning for Few-Shot Emotional Speech Recognition [61.463533069294414]
We view speaker adaptation as a few-shot learning problem.
We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives.
arXiv Detail & Related papers (2023-02-24T22:38:54Z) - End-to-End Automatic Speech Recognition model for the Sudanese Dialect [0.0]
This paper comes to inspect the viability of designing an Automatic Speech Recognition model for the Sudanese dialect.
The paper gives an overview of the Sudanese dialect and the tasks of collecting represented resources and pre-processing performed to construct a modest dataset.
The designed model provided some insights into the current recognition task and reached an average Label Error Rate of 73.67%.
arXiv Detail & Related papers (2022-12-21T07:35:33Z) - Quantifying Language Variation Acoustically with Few Resources [4.162663632560141]
Deep acoustic models might have learned linguistic information that transfers to low-resource languages.
We compute pairwise pronunciation differences averaged over 10 words for over 100 individual dialects from four (regional) languages.
Our results show that acoustic models outperform the (traditional) transcription-based approach without requiring phonetic transcriptions.
arXiv Detail & Related papers (2022-05-05T15:00:56Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech
Recognition System [3.4888132404740797]
We evaluate a state-of-the-art automatic speech recognition model, using unseen data from a corpus with a wide variety of labeled English accents.
We show that there is indeed an accuracy bias in terms of accentual variety, favoring the accents most prevalent in the training corpus.
arXiv Detail & Related papers (2021-05-09T08:24:33Z) - Learning to Recognize Dialect Features [21.277962038423123]
We introduce the task of dialect feature detection, and present two multitask learning approaches.
We train our models on a small number of minimal pairs, building on how linguists typically define dialect features.
arXiv Detail & Related papers (2020-10-23T23:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.