Text-Aware End-to-end Mispronunciation Detection and Diagnosis
- URL: http://arxiv.org/abs/2206.07289v1
- Date: Wed, 15 Jun 2022 04:08:10 GMT
- Title: Text-Aware End-to-end Mispronunciation Detection and Diagnosis
- Authors: Linkai Peng, Yingming Gao, Binghuai Lin, Dengfeng Ke, Yanlu Xie,
Jinsong Zhang
- Abstract summary: Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT)
In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
- Score: 17.286013739453796
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Mispronunciation detection and diagnosis (MDD) technology is a key component
of computer-assisted pronunciation training system (CAPT). In the field of
assessing the pronunciation quality of constrained speech, the given
transcriptions can play the role of a teacher. Conventional methods have fully
utilized the prior texts for the model construction or improving the system
performance, e.g. forced-alignment and extended recognition networks. Recently,
some end-to-end based methods attempt to incorporate the prior texts into model
training and preliminarily show the effectiveness. However, previous studies
mostly consider applying raw attention mechanism to fuse audio representations
with text representations, without taking possible text-pronunciation mismatch
into account. In this paper, we present a gating strategy that assigns more
importance to the relevant audio features while suppressing irrelevant text
information. Moreover, given the transcriptions, we design an extra contrastive
loss to reduce the gap between the learning objective of phoneme recognition
and MDD. We conducted experiments using two publicly available datasets (TIMIT
and L2-Arctic) and our best model improved the F1 score from $57.51\%$ to
$61.75\%$ compared to the baselines. Besides, we provide a detailed analysis to
shed light on the effectiveness of gating mechanism and contrastive learning on
MDD.
Related papers
- Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Text Injection for Capitalization and Turn-Taking Prediction in Speech
Models [45.94388391693112]
This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model.
We show results demonstrating that our text injection method boosts capitalization performance for long-tail data.
arXiv Detail & Related papers (2023-08-14T18:28:04Z) - Improving Deliberation by Text-Only and Semi-Supervised Training [42.942428288428836]
We propose incorporating text-only and semi-supervised training into an attention-based deliberation model.
We achieve 4%-12% WER reduction for various tasks compared to the baseline deliberation.
We show that the deliberation model also achieves a positive human side-by-side evaluation.
arXiv Detail & Related papers (2022-06-29T15:30:44Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - The effectiveness of unsupervised subword modeling with autoregressive
and cross-lingual phone-aware networks [36.24509775775634]
We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer.
Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies.
arXiv Detail & Related papers (2020-12-17T12:33:49Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Phoneme Boundary Detection using Learnable Segmental Features [31.203969460341817]
Phoneme boundary detection plays an essential first step for a variety of speech processing applications.
We propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection.
arXiv Detail & Related papers (2020-02-11T14:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.