SpeechBlender: Speech Augmentation Framework for Mispronunciation Data
Generation
- URL: http://arxiv.org/abs/2211.00923v3
- Date: Wed, 12 Jul 2023 12:28:56 GMT
- Title: SpeechBlender: Speech Augmentation Framework for Mispronunciation Data
Generation
- Authors: Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali, Hamdy Mubarak,
and Shazia Afzal
- Abstract summary: SpeechBlender is a fine-grained data augmentation pipeline for generating mispronunciation errors.
Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models.
- Score: 11.91301106502376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The lack of labeled second language (L2) speech data is a major challenge in
designing mispronunciation detection models. We introduce SpeechBlender - a
fine-grained data augmentation pipeline for generating mispronunciation errors
to overcome such data scarcity. The SpeechBlender utilizes varieties of masks
to target different regions of phonetic units, and use the mixing factors to
linearly interpolate raw speech signals while augmenting pronunciation. The
masks facilitate smooth blending of the signals, generating more effective
samples than the `Cut/Paste' method. Our proposed technique achieves
state-of-the-art results, with Speechocean762, on ASR dependent
mispronunciation detection models at phoneme level, with a 2.0% gain in Pearson
Correlation Coefficient (PCC) compared to the previous state-of-the-art [1].
Additionally, we demonstrate a 5.0% improvement at the phoneme level compared
to our baseline. We also observed a 4.6% increase in F1-score with Arabic
AraVoiceL2 testset.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - Improving Mispronunciation Detection with Wav2vec2-based Momentum
Pseudo-Labeling for Accentedness and Intelligibility Assessment [28.76055994423364]
Current mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition.
One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech.
We leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models.
arXiv Detail & Related papers (2022-03-29T22:40:31Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - Text Augmentation for Language Models in High Error Recognition Scenario [0.0]
We compare augmentation based on global error statistics with one based on per-word unigram statistics of ASR errors.
Our best augmentation scheme increases the absolute WER improvement from second-pass rescoring from 1.1 % to 1.9 % absolute on the CHiMe-6 challenge.
arXiv Detail & Related papers (2020-11-11T20:21:21Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.