Computer-assisted Pronunciation Training -- Speech synthesis is almost
all you need
- URL: http://arxiv.org/abs/2207.00774v1
- Date: Sat, 2 Jul 2022 08:33:33 GMT
- Title: Computer-assisted Pronunciation Training -- Speech synthesis is almost
all you need
- Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Bozena Kostek
- Abstract summary: Existing CAPT methods are not able to detect pronunciation errors with high accuracy.
We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion.
We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field.
- Score: 18.446969150062586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The research community has long studied computer-assisted pronunciation
training (CAPT) methods in non-native speech. Researchers focused on studying
various model architectures, such as Bayesian networks and deep learning
methods, as well as on the analysis of different representations of the speech
signal. Despite significant progress in recent years, existing CAPT methods are
not able to detect pronunciation errors with high accuracy (only 60\% precision
at 40\%-80\% recall). One of the key problems is the low availability of
mispronounced speech that is needed for the reliable training of pronunciation
error detection models. If we had a generative model that could mimic
non-native speech and produce any amount of training data, then the task of
detecting pronunciation errors would be much easier. We present three
innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S),
and speech-to-speech (S2S) conversion to generate correctly pronounced and
mispronounced synthetic speech. We show that these techniques not only improve
the accuracy of three machine learning models for detecting pronunciation
errors but also help establish a new state-of-the-art in the field. Earlier
studies have used simple speech generation techniques such as P2P conversion,
but only as an additional mechanism to improve the accuracy of pronunciation
error detection. We, on the other hand, consider speech generation to be the
first-class method of detecting pronunciation errors. The effectiveness of
these techniques is assessed in the tasks of detecting pronunciation and
lexical stress errors. Non-native English speech corpora of German, Italian,
and Polish speakers are used in the evaluations. The best proposed S2S
technique improves the accuracy of detecting pronunciation errors in AUC metric
by 41\% from 0.528 to 0.749 compared to the state-of-the-art approach.
Related papers
- Phonological Level wav2vec2-based Mispronunciation Detection and
Diagnosis Method [11.069975459609829]
We propose a low-level Mispronunciation Detection and Diagnosis (MDD) approach based on the detection of speech attribute features.
The proposed method was applied to L2 speech corpora collected from English learners from different native languages.
arXiv Detail & Related papers (2023-11-13T02:41:41Z) - Controllable Emphasis with zero data for text-to-speech [57.12383531339368]
A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word.
We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3%$ and correct testers' identification of the emphasised word in a sentence by $40%$ on a reference female en-US voice.
arXiv Detail & Related papers (2023-07-13T21:06:23Z) - DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation
Detection and Correction [1.8322859214908722]
We present a highly-precise, PDA-compatible pronunciation learning framework for the task of TTS mispronunciation detection and correction.
We also propose a novel mispronunciation detection model called DTW-SiameseNet, which employs metric learning with a Siamese architecture for Dynamic Time Warping (DTW) with triplet loss.
Human evaluation shows our proposed approach improves pronunciation accuracy on average by 6% compared to strong phoneme-based and audio-based baselines.
arXiv Detail & Related papers (2023-03-01T01:53:11Z) - Multilingual Zero Resource Speech Recognition Base on Self-Supervise
Pre-Trained Acoustic Models [14.887781621924255]
This paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition.
It is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts.
Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less than 20% word error rate on some languages.
arXiv Detail & Related papers (2022-10-13T12:11:18Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Weakly-supervised word-level pronunciation error detection in non-native
English speech [14.430965595136149]
weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech.
Phonetically transcribed L2 speech is not required and we only need to mark mispronounced words.
Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.
arXiv Detail & Related papers (2021-06-07T10:31:53Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Mispronunciation Detection in Non-native (L2) English with Uncertainty
Modeling [13.451106880540326]
A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker.
We propose a novel approach to overcome this problem based on two principles.
We evaluate the model on non-native (L2) English speech of German, Italian and Polish speakers, where it is shown to increase the precision of detecting mispronunciations by up to 18%.
arXiv Detail & Related papers (2021-01-16T08:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.