Related papers: Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

URL: http://arxiv.org/abs/2310.17606v1
Date: Thu, 26 Oct 2023 17:30:13 GMT
Title: Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana
Authors: Owen Henkel, Hannah Horne-Robinson, Libby Hills, Bill Roberts, Joshua McGrane
Abstract summary: This paper reports on three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana. We find that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate of 13.5. This is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper reports on a set of three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana. While ORF is a well-established measure of foundational literacy, assessing it typically requires one-on-one sessions between a student and a trained evaluator, a process that is time-consuming and costly. Automating the evaluation of ORF could support better literacy instruction, particularly in education contexts where formative assessment is uncommon due to large class sizes and limited resources. To our knowledge, this research is among the first to examine the use of the most recent versions of large-scale speech models (Whisper V2 wav2vec2.0) for ORF assessment in the Global South. We find that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate of 13.5. This is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago. We also find that when these transcriptions are used to produce fully automated ORF scores, they closely align with scores generated by expert human graders, with a correlation coefficient of 0.96. Importantly, these results were achieved on a representative dataset (i.e., students with regional accents, recordings taken in actual classrooms), using a free and publicly available speech model out of the box (i.e., no fine-tuning). This suggests that using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.

Related papers

An End-to-End Approach for Child Reading Assessment in the Xhosa Language [0.3579433677269426]
This study focuses on Xhosa, a language spoken in South Africa, to advance child speech recognition capabilities.<n>We present a novel dataset composed of child speech samples in Xhosa.<n>The results indicate that the performance of these models can be significantly influenced by the amount and balancing of the available training data.
arXiv Detail & Related papers (2025-05-23T00:59:58Z)
Who Said What WSW 2.0? Enhanced Automated Analysis of Preschool Classroom Speech [24.034728707160497]
This paper introduces an automated framework WSW2.0 for analyzing vocal interactions in preschool classrooms.<n>WSW2.0 achieves a weighted F1 score of.845, accuracy of.846, and an error-corrected kappa of.672 for speaker classification (child vs. teacher)<n>We apply the framework to an extensive dataset spanning two years and over 1,592 hours of classroom audio recordings.
arXiv Detail & Related papers (2025-05-15T05:21:34Z)
Automatic Proficiency Assessment in L2 English Learners [51.652753736780205]
Second language proficiency (L2) in English is usually perceptually evaluated by English teachers or expert evaluators.<n>This paper explores deep learning techniques for comprehensive L2 proficiency assessment, addressing both the speech signal and its correspondent transcription.
arXiv Detail & Related papers (2025-05-05T12:36:03Z)
Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning [9.670752318129326]
We first compare wav2vec 2.0, HuBERT and WavLM models adapted to phoneme recognition in French child speech. We then adapt it by unfreezing its transformer blocks during fine-tuning on child speech. We show that WavLM base+ is more robust to various reading tasks and noise levels.
arXiv Detail & Related papers (2025-03-06T18:57:16Z)
Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs. We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors. We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
Reading Miscue Detection in Primary School through Automatic Speech Recognition [10.137389745562512]
This study investigates how efficiently state-of-the-art (SOTA) pretrained ASR models recognize Dutch native children speech. We found that Hubert Large finetuned on Dutch speech achieves SOTA phoneme-level child speech recognition. Wav2Vec2 Large shows the highest recall at 0.83, whereas Whisper exhibits the highest precision at 0.52 and an F1 score of 0.52.
arXiv Detail & Related papers (2024-06-11T08:41:21Z)
Who Said What? An Automated Approach to Analyzing Speech in Preschool Classrooms [0.4207829324073153]
We propose an automated framework that uses software to classify speakers and to transcribe their utterances. We compare results from our framework to those from a human expert for 110 minutes of classroom recordings. The results suggest substantial progress in analyzing classroom speech that may support children's language development.
arXiv Detail & Related papers (2024-01-14T18:27:37Z)
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)
wav2vec and its current potential to Automatic Speech Recognition in German for the usage in Digital History: A comparative assessment of available ASR-technologies for the use in cultural heritage contexts [0.0]
We train and publish a state-of-the-art open-source model for Automatic Speech Recognition for German. We evaluate the current potential of this technology for the use in the larger context of Digital Humanities and cultural heritage indexation. We argue that ASR will become a key technology for the documentation and analysis of audio-visual sources.
arXiv Detail & Related papers (2023-03-06T22:24:31Z)
Proficiency assessment of L2 spoken English using wav2vec 2.0 [3.4012007729454816]
We use wav2vec 2.0 for assessing overall and individual aspects of proficiency on two small datasets. We find that this approach significantly outperforms the BERT-based baseline system trained on ASR and manual transcriptions used for comparison.
arXiv Detail & Related papers (2022-10-24T12:36:49Z)
Nonwords Pronunciation Classification in Language Development Tests for Preschool Children [7.224391516694955]
This work aims to automatically evaluate whether the language development of children is age-appropriate. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model specific language structures.
arXiv Detail & Related papers (2022-06-16T10:19:47Z)
Self-supervised models of audio effectively explain human cortical responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system. We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)
Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis. We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z)
Unsupervised Cross-lingual Representation Learning for Speech Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations. Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.