Automatic Assessment of Oral Reading Accuracy for Reading Diagnostics
- URL: http://arxiv.org/abs/2306.03444v1
- Date: Tue, 6 Jun 2023 06:49:58 GMT
- Title: Automatic Assessment of Oral Reading Accuracy for Reading Diagnostics
- Authors: Bo Molenaar, Cristian Tejedor-Garcia, Helmer Strik, Catia Cucchiarini
- Abstract summary: We evaluate six state-of-the-art ASR-based systems for automatically assessing Dutch oral reading accuracy using Kaldi and Whisper.
Results show our most successful system reached substantial agreement with human evaluations.
- Score: 9.168525887419388
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic assessment of reading fluency using automatic speech recognition
(ASR) holds great potential for early detection of reading difficulties and
subsequent timely intervention. Precise assessment tools are required,
especially for languages other than English. In this study, we evaluate six
state-of-the-art ASR-based systems for automatically assessing Dutch oral
reading accuracy using Kaldi and Whisper. Results show our most successful
system reached substantial agreement with human evaluations (MCC = .63). The
same system reached the highest correlation between forced decoding confidence
scores and word correctness (r = .45). This system's language model (LM)
consisted of manual orthographic transcriptions and reading prompts of the test
data, which shows that including reading errors in the LM improves assessment
performance. We discuss the implications for developing automatic assessment
systems and identify possible avenues of future research.
Related papers
- Spoken Grammar Assessment Using LLM [10.761744330206065]
Spoken language assessment (SLA) systems analyse the pronunciation and oral fluency of a speaker by analysing the read and spontaneous spoken utterances respectively.
Most WLA systems present a set of sentences from a curated finite-size database of sentences thereby making it possible to anticipate the test questions and train oneself.
We propose a novel end-to-end SLA system to assess language grammar from spoken utterances thus making WLA systems redundant.
arXiv Detail & Related papers (2024-10-02T14:15:13Z) - Zero-shot Generative Large Language Models for Systematic Review
Screening Automation [55.403958106416574]
This study investigates the effectiveness of using zero-shot large language models for automatic screening.
We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold.
arXiv Detail & Related papers (2024-01-12T01:54:08Z) - Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and Neural Machine Translation [1.6982207802596105]
This study investigates the convergences and divergences between automated metrics and human evaluation.
To perform automatic assessment, four automated metrics are employed, while human evaluation incorporates the DQF-MQM error typology and six rubrics.
Results underscore the indispensable role of human judgment in evaluating the performance of advanced translation tools.
arXiv Detail & Related papers (2024-01-10T14:20:33Z) - Exploiting prompt learning with pre-trained language models for
Alzheimer's Disease detection [70.86672569101536]
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and to delay further progression.
This paper investigates the use of prompt-based fine-tuning of PLMs that consistently uses AD classification errors as the training objective function.
arXiv Detail & Related papers (2022-10-29T09:18:41Z) - Exploring linguistic feature and model combination for speech
recognition based automatic AD detection [61.91708957996086]
Speech based automatic AD screening systems provide a non-intrusive and more scalable alternative to other clinical screening techniques.
Scarcity of specialist data leads to uncertainty in both model selection and feature learning when developing such systems.
This paper investigates the use of feature and model combination approaches to improve the robustness of domain fine-tuning of BERT and Roberta pre-trained text encoders.
arXiv Detail & Related papers (2022-06-28T05:09:01Z) - Automated Evaluation of Standardized Dementia Screening Tests [0.18472148461613155]
We report on a study that consists of a semi-standardized history taking followed by two standardized neuropsychological tests.
The tests include basic tasks such as naming objects, learning word lists, but also widely used tools such as the MMSE.
We show that using word alternatives helps to mitigate recognition errors and subsequently improves correlation with expert scores.
arXiv Detail & Related papers (2022-06-13T14:41:27Z) - Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation [25.325624543852086]
We propose a general methodology for adversarial testing of Quality Estimation for Machine Translation (MT) systems.
We show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect.
Second, we show that on average, the ability of a given model to discriminate between meaning-preserving and meaning-altering perturbations is predictive of its overall performance.
arXiv Detail & Related papers (2021-09-22T17:32:18Z) - NUVA: A Naming Utterance Verifier for Aphasia Treatment [49.114436579008476]
Assessment of speech performance using picture naming tasks is a key method for both diagnosis and monitoring of responses to treatment interventions by people with aphasia (PWA)
Here we present NUVA, an utterance verification system incorporating a deep learning element that classifies 'correct' versus'incorrect' naming attempts from aphasic stroke patients.
When tested on eight native British-English speaking PWA the system's performance accuracy ranged between 83.6% to 93.6%, with a 10-fold cross-validation mean of 89.5%.
arXiv Detail & Related papers (2021-02-10T13:00:29Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.