Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet
- URL: http://arxiv.org/abs/2508.16576v1
- Date: Fri, 22 Aug 2025 17:59:35 GMT
- Title: Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet
- Authors: Anyu Ying, Natarajan Balaji Shankar, Chyi-Jiunn Lin, Mohan Shi, Pu Wang, Hye-jin Shim, Siddhant Arora, Hugo Van hamme, Abeer Alwan, Shinji Watanabe,
- Abstract summary: We compare flat-start training across datasets, SSL representations (WavLM, XEUS), and decoder architectures.<n> SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases.<n>Age-related ASR and speaker verification analysis highlights the limitations of proprietary models.
- Score: 72.53502346791814
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite advancements in ASR, child speech recognition remains challenging due to acoustic variability and limited annotated data. While fine-tuning adult ASR models on child speech is common, comparisons with flat-start training remain underexplored. We compare flat-start training across multiple datasets, SSL representations (WavLM, XEUS), and decoder architectures. Our results show that SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases. We also analyze model scaling, finding consistent improvements up to 1B parameters, beyond which performance plateaus. Additionally, age-related ASR and speaker verification analysis highlights the limitations of proprietary models like Whisper, emphasizing the need for open-data models for reliable child speech research. All investigations are conducted using ESPnet, and our publicly available benchmark provides insights into training strategies for robust child speech processing.
Related papers
- Self-Supervised Learning for Speaker Recognition: A study and review [0.0]
Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations.<n>The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR) remains in its early stages.<n>This work aims to highlight recent trends and advancements, identifying current challenges in the field.
arXiv Detail & Related papers (2026-02-11T13:16:07Z) - Arabic Little STT: Arabic Children Speech Recognition Dataset [0.0]
We present our created dataset, Arabic Little STT, a dataset of Levantine Arabic child speech recorded in classrooms.<n>We also conduct a systematic assessment of Whisper, a state-of-the-art automatic speech recognition (ASR) model, on this dataset.<n>Our evaluation reveals that even the best-performing model (Large_v3) struggles significantly, achieving a 0.66 word error rate (WER) on child speech.
arXiv Detail & Related papers (2025-10-27T13:30:54Z) - Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Children's Speech? [43.31597557333867]
This study investigates the effectiveness of layer-wise features extracted from state-of-the-art SSL pre-trained models in improving the performance of ASR for children's speech in zero-shot scenarios.<n>The analysis identified the most effective layers for enhancing ASR performance on children's speech in a zero-shot scenario.
arXiv Detail & Related papers (2025-08-28T21:32:36Z) - Towards few-shot isolated word reading assessment [17.85337022148277]
We explore an ASR-free method for isolated word reading assessment in low-resource settings.<n>Our few-shot approach compares input child speech to a small set of adult-provided reference templates.<n>Despite the success of employing SSL representations in low-resource speech tasks, our work highlights the limitations of SSL representations for processing child data.
arXiv Detail & Related papers (2025-07-16T13:20:32Z) - Examining Test-Time Adaptation for Personalized Child Speech Recognition [26.233159818496006]
Test-time adaptation (TTA) methods have shown great potential in bridging this domain gap.<n>We investigate the effectiveness of two widely used TTA methods-SUTA and SGEM-in adapting off-the-shelf ASR models and their fine-tuned versions for child speech recognition.<n>Our findings show that TTA significantly improves the performance of both off-the-shelf and fine-tuned ASR models, both on average and across individual child speakers.
arXiv Detail & Related papers (2024-09-19T21:40:07Z) - A comparative analysis between Conformer-Transducer, Whisper, and
wav2vec2 for improving the child speech recognition [2.965450563218781]
We show that finetuning Conformer-transducer models on child speech yields significant improvements in ASR performance on child speech.
We also show Whisper and wav2vec2 adaptation on different child speech datasets.
arXiv Detail & Related papers (2023-11-07T19:32:48Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Evidence of Vocal Tract Articulation in Self-Supervised Learning of
Speech [15.975756437343742]
Recent self-supervised learning (SSL) models have proven to learn rich representations of speech.
We conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA)
Our findings suggest that SSL models learn to align closely with continuous articulations, and provide a novel insight into speech SSL.
arXiv Detail & Related papers (2022-10-21T04:24:29Z) - Transfer Learning for Robust Low-Resource Children's Speech ASR with
Transformers and Source-Filter Warping [11.584388304271029]
We propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech.
Using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data.
This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora.
arXiv Detail & Related papers (2022-06-19T12:57:47Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.