Improving Child Speech Recognition and Reading Mistake Detection by Using Prompts
- URL: http://arxiv.org/abs/2506.11079v1
- Date: Wed, 04 Jun 2025 05:55:12 GMT
- Title: Improving Child Speech Recognition and Reading Mistake Detection by Using Prompts
- Authors: Lingyun Gao, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik,
- Abstract summary: Best performing system achieved state-of-the-art recognition performance in Dutch child read speech.<n>It significantly improved reading mistake detection, increasing the F1 score from 0.39 to 0.73.
- Score: 10.137389745562512
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic reading aloud evaluation can provide valuable support to teachers by enabling more efficient scoring of reading exercises. However, research on reading evaluation systems and applications remains limited. We present a novel multimodal approach that leverages audio and knowledge from text resources. In particular, we explored the potential of using Whisper and instruction-tuned large language models (LLMs) with prompts to improve transcriptions for child speech recognition, as well as their effectiveness in downstream reading mistake detection. Our results demonstrate the effectiveness of prompting Whisper and prompting LLM, compared to the baseline Whisper model without prompting. The best performing system achieved state-of-the-art recognition performance in Dutch child read speech, with a word error rate (WER) of 5.1%, improving the baseline WER of 9.4%. Furthermore, it significantly improved reading mistake detection, increasing the F1 score from 0.39 to 0.73.
Related papers
- Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling [0.0]
This study assesses five cutting-edge ASR systems' recognition of non-native English accented speech using recordings from the L2-ARCTIC corpus.<n>For read speech, Whisper and AssemblyAI achieved the best accuracy with mean Match Error Rates (MER) of 0.054 and 0.056 respectively.<n>For spontaneous speech, RevAI performed best with a mean MER of 0.063.
arXiv Detail & Related papers (2025-03-10T05:09:44Z) - Reading Miscue Detection in Primary School through Automatic Speech Recognition [10.137389745562512]
This study investigates how efficiently state-of-the-art (SOTA) pretrained ASR models recognize Dutch native children speech.
We found that Hubert Large finetuned on Dutch speech achieves SOTA phoneme-level child speech recognition.
Wav2Vec2 Large shows the highest recall at 0.83, whereas Whisper exhibits the highest precision at 0.52 and an F1 score of 0.52.
arXiv Detail & Related papers (2024-06-11T08:41:21Z) - Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults [4.765434968114876]
We enhance the utility of the MyST dataset through more efficient data preprocessing.
We show that this improvement can be generalized to unseen datasets.
Results showcase the viable and efficient integration of Whisper for effective children's speech recognition.
arXiv Detail & Related papers (2023-09-12T06:58:18Z) - Leveraging Visemes for Better Visual Speech Representation and Lip
Reading [2.7836084563851284]
We propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading.
The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
arXiv Detail & Related papers (2023-07-19T17:38:26Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - Automatic Assessment of Oral Reading Accuracy for Reading Diagnostics [9.168525887419388]
We evaluate six state-of-the-art ASR-based systems for automatically assessing Dutch oral reading accuracy using Kaldi and Whisper.
Results show our most successful system reached substantial agreement with human evaluations.
arXiv Detail & Related papers (2023-06-06T06:49:58Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - NUVA: A Naming Utterance Verifier for Aphasia Treatment [49.114436579008476]
Assessment of speech performance using picture naming tasks is a key method for both diagnosis and monitoring of responses to treatment interventions by people with aphasia (PWA)
Here we present NUVA, an utterance verification system incorporating a deep learning element that classifies 'correct' versus'incorrect' naming attempts from aphasic stroke patients.
When tested on eight native British-English speaking PWA the system's performance accuracy ranged between 83.6% to 93.6%, with a 10-fold cross-validation mean of 89.5%.
arXiv Detail & Related papers (2021-02-10T13:00:29Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - Improved Noisy Student Training for Automatic Speech Recognition [89.8397907990268]
"Noisy student training" is an iterative self-training method that leverages augmentation to improve network performance.
We find effective methods to filter, balance and augment the data generated in between self-training iterations.
We are able to improve upon the previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h (4.74%/12.20%) and LibriSpeech (1.9%/4.1%)
arXiv Detail & Related papers (2020-05-19T17:57:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.