A transfer learning based approach for pronunciation scoring
- URL: http://arxiv.org/abs/2111.00976v2
- Date: Tue, 9 May 2023 16:43:19 GMT
- Title: A transfer learning based approach for pronunciation scoring
- Authors: Marcelo Sancinetti, Jazmin Vidal, Cyntia Bonomi, Luciana Ferrer
- Abstract summary: Phone-level pronunciation scoring is a challenging task, with performance far from that of human annotators.
Standard systems generate a score for each phone in a phrase using models trained for automatic speech recognition (ASR) with native data only.
We present a transfer learning-based approach that leverages a model trained for ASR, adapting it for the task of pronunciation scoring.
- Score: 7.98890440106366
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Phone-level pronunciation scoring is a challenging task, with performance far
from that of human annotators. Standard systems generate a score for each phone
in a phrase using models trained for automatic speech recognition (ASR) with
native data only. Better performance has been shown when using systems that are
trained specifically for the task using non-native data. Yet, such systems face
the challenge that datasets labelled for this task are scarce and usually
small. In this paper, we present a transfer learning-based approach that
leverages a model trained for ASR, adapting it for the task of pronunciation
scoring. We analyze the effect of several design choices and compare the
performance with a state-of-the-art goodness of pronunciation (GOP) system. Our
final system is 20% better than the GOP system on EpaDB, a database for
pronunciation scoring research, for a cost function that prioritizes low rates
of unnecessary corrections.
Related papers
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data.
In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z) - Mispronunciation detection using self-supervised speech representations [10.010024759851142]
We study the use of SSL models for the task of mispronunciation detection for second language learners.
We compare two downstream approaches: 1) training the model for phone recognition using native English data, and 2) training a model directly for the target task using non-native English data.
arXiv Detail & Related papers (2023-07-30T21:20:58Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.