Improving child speech recognition with augmented child-like speech
- URL: http://arxiv.org/abs/2406.10284v1
- Date: Wed, 12 Jun 2024 08:56:46 GMT
- Title: Improving child speech recognition with augmented child-like speech
- Authors: Yuanyuan Zhang, Zhengjun Yue, Tanvina Patel, Odette Scharenborg,
- Abstract summary: Cross-lingual child-to-child voice conversion significantly improved child ASR performance.
State-of-the-art ASRs show suboptimal performance for child speech.
- Score: 20.709414063132627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of "high-quality" VC-generated data achieved similar results to those of our best-FT models.
Related papers
- Evaluation of state-of-the-art ASR Models in Child-Adult Interactions [27.30130353688078]
Speech foundation models show a noticeable performance drop (15-20% absolute WER) for child speech compared to adult speech in the conversational setting.
We employ LoRA on the best performing zero shot model (whisper-large) to probe the effectiveness of fine-tuning in a low resource setting.
arXiv Detail & Related papers (2024-09-24T14:42:37Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - A comparative analysis between Conformer-Transducer, Whisper, and
wav2vec2 for improving the child speech recognition [2.965450563218781]
We show that finetuning Conformer-transducer models on child speech yields significant improvements in ASR performance on child speech.
We also show Whisper and wav2vec2 adaptation on different child speech datasets.
arXiv Detail & Related papers (2023-11-07T19:32:48Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Adaptation of Whisper models to child speech recognition [3.2548794659022398]
We show that finetuning Whisper on child speech yields significant improvements in ASR performance on child speech.
utilizing self-supervised Wav2vec2 models that have been finetuned on child speech outperforms Whisper finetuning.
arXiv Detail & Related papers (2023-07-24T12:54:45Z) - Automatic Speech Recognition of Non-Native Child Speech for Language
Learning Applications [18.849741353784328]
We assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI.
We evaluate their performance on read and extemporaneous speech of native and non-native Dutch children.
arXiv Detail & Related papers (2023-06-29T06:14:26Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Transfer Learning for Robust Low-Resource Children's Speech ASR with
Transformers and Source-Filter Warping [11.584388304271029]
We propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech.
Using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data.
This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora.
arXiv Detail & Related papers (2022-06-19T12:57:47Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.