A Full Text-Dependent End to End Mispronunciation Detection and
Diagnosis with Easy Data Augmentation Techniques
- URL: http://arxiv.org/abs/2104.08428v1
- Date: Sat, 17 Apr 2021 03:11:41 GMT
- Title: A Full Text-Dependent End to End Mispronunciation Detection and
Diagnosis with Easy Data Augmentation Techniques
- Authors: Kaiqi Fu and Jones Lin and Dengfeng Ke and Yanlu Xie and Jinsong Zhang
and Binghuai Lin
- Abstract summary: We present a novel text-dependent model which is difference with sed-mdd.
We propose three simple data augmentation methods, which effectively improve the ability of model to capture mispronounced phonemes.
- Score: 28.59181595057581
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, end-to-end mispronunciation detection and diagnosis (MD&D) systems
has become a popular alternative to greatly simplify the model-building process
of conventional hybrid DNN-HMM systems by representing complicated modules with
a single deep network architecture. In this paper, in order to utilize the
prior text in the end-to-end structure, we present a novel text-dependent model
which is difference with sed-mdd, the model achieves a fully end-to-end system
by aligning the audio with the phoneme sequences of the prior text inside the
model through the attention mechanism. Moreover, the prior text as input will
be a problem of imbalance between positive and negative samples in the phoneme
sequence. To alleviate this problem, we propose three simple data augmentation
methods, which effectively improve the ability of model to capture
mispronounced phonemes. We conduct experiments on L2-ARCTIC, and our best
performance improved from 49.29% to 56.08% in F-measure metric compared to the
CNN-RNN-CTC model.
Related papers
- DA-Flow: Dual Attention Normalizing Flow for Skeleton-based Video Anomaly Detection [52.74152717667157]
We propose a lightweight module called Dual Attention Module (DAM) for capturing cross-dimension interaction relationships in-temporal skeletal data.
It employs the frame attention mechanism to identify the most significant frames and the skeleton attention mechanism to capture broader relationships across fixed partitions with minimal parameters and flops.
arXiv Detail & Related papers (2024-06-05T06:18:03Z) - Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation [13.16188747098854]
We propose a novel attention-based encoder-decoder (HAED) speech recognition model.
Our model separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques.
We demonstrate that the proposed HAED model yields 23% relative Word Error Rate (WER) improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2023-09-14T01:07:36Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Leveraging Symmetrical Convolutional Transformer Networks for Speech to
Singing Voice Style Transfer [49.01417720472321]
We develop a novel neural network architecture, called SymNet, which models the alignment of the input speech with the target melody.
Experiments are performed on the NUS and NHSS datasets which consist of parallel data of speech and singing voice.
arXiv Detail & Related papers (2022-08-26T02:54:57Z) - Continuous Offline Handwriting Recognition using Deep Learning Models [0.0]
Handwritten text recognition is an open problem of great interest in the area of automatic document image analysis.
We have proposed a new recognition model based on integrating two types of deep learning architectures: convolutional neural networks (CNN) and sequence-to-sequence (seq2seq)
The new proposed model provides competitive results with those obtained with other well-established methodologies.
arXiv Detail & Related papers (2021-12-26T07:31:03Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Improving Tail Performance of a Deliberation E2E ASR Model Using a Large
Text Corpus [35.45918249451485]
End-to-end (E2E) automatic speech recognition systems lack the distinct language model (LM) component that characterizes traditional speech systems.
shallow fusion has been proposed as a method for incorporating a pre-trained LM into an E2E model at inference time.
We apply shallow fusion to incorporate a very large text corpus into a state-of-the-art E2EASR model.
arXiv Detail & Related papers (2020-08-24T14:53:10Z) - An Effective End-to-End Modeling Approach for Mispronunciation Detection [12.113290059233977]
We present a novel use of CTCAttention approach to the Mispronunciation detection task.
We also perform input augmentation with text prompt information to make the resulting E2E model more tailored for the MD task.
A series of Mandarin MD experiments demonstrate that our approach brings about systematic and substantial performance improvements.
arXiv Detail & Related papers (2020-05-18T03:37:21Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.