A comparative analysis between Conformer-Transducer, Whisper, and
wav2vec2 for improving the child speech recognition
- URL: http://arxiv.org/abs/2311.04936v1
- Date: Tue, 7 Nov 2023 19:32:48 GMT
- Title: A comparative analysis between Conformer-Transducer, Whisper, and
wav2vec2 for improving the child speech recognition
- Authors: Andrei Barcovschi and Rishabh Jain and Peter Corcoran
- Abstract summary: We show that finetuning Conformer-transducer models on child speech yields significant improvements in ASR performance on child speech.
We also show Whisper and wav2vec2 adaptation on different child speech datasets.
- Score: 2.965450563218781
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Speech Recognition (ASR) systems have progressed significantly in
their performance on adult speech data; however, transcribing child speech
remains challenging due to the acoustic differences in the characteristics of
child and adult voices. This work aims to explore the potential of adapting
state-of-the-art Conformer-transducer models to child speech to improve child
speech recognition performance. Furthermore, the results are compared with
those of self-supervised wav2vec2 models and semi-supervised multi-domain
Whisper models that were previously finetuned on the same data. We demonstrate
that finetuning Conformer-transducer models on child speech yields significant
improvements in ASR performance on child speech, compared to the non-finetuned
models. We also show Whisper and wav2vec2 adaptation on different child speech
datasets. Our detailed comparative analysis shows that wav2vec2 provides the
most consistent performance improvements among the three methods studied.
Related papers
- Evaluation of state-of-the-art ASR Models in Child-Adult Interactions [27.30130353688078]
Speech foundation models show a noticeable performance drop (15-20% absolute WER) for child speech compared to adult speech in the conversational setting.
We employ LoRA on the best performing zero shot model (whisper-large) to probe the effectiveness of fine-tuning in a low resource setting.
arXiv Detail & Related papers (2024-09-24T14:42:37Z) - Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Improving child speech recognition with augmented child-like speech [20.709414063132627]
Cross-lingual child-to-child voice conversion significantly improved child ASR performance.
State-of-the-art ASRs show suboptimal performance for child speech.
arXiv Detail & Related papers (2024-06-12T08:56:46Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Adaptation of Whisper models to child speech recognition [3.2548794659022398]
We show that finetuning Whisper on child speech yields significant improvements in ASR performance on child speech.
utilizing self-supervised Wav2vec2 models that have been finetuned on child speech outperforms Whisper finetuning.
arXiv Detail & Related papers (2023-07-24T12:54:45Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Transfer Learning for Robust Low-Resource Children's Speech ASR with
Transformers and Source-Filter Warping [11.584388304271029]
We propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech.
Using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data.
This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora.
arXiv Detail & Related papers (2022-06-19T12:57:47Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Learning to Understand Child-directed and Adult-directed Speech [18.29692441616062]
Human language acquisition research indicates that child-directed speech helps language learners.
We compare the task performance of models trained on adult-directed speech (ADS) and child-directed speech (CDS)
We find indications that CDS helps in the initial stages of learning, but eventually, models trained on ADS reach comparable task performance, and generalize better.
arXiv Detail & Related papers (2020-05-06T10:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.