Automated Evaluation of Standardized Dementia Screening Tests
- URL: http://arxiv.org/abs/2206.06208v1
- Date: Mon, 13 Jun 2022 14:41:27 GMT
- Title: Automated Evaluation of Standardized Dementia Screening Tests
- Authors: Franziska Braun, Markus F\"orstel, Bastian Oppermann, Andreas
Erzigkeit, Thomas Hillemacher, Hartmut Lehfeld, Korbinian Riedhammer
- Abstract summary: We report on a study that consists of a semi-standardized history taking followed by two standardized neuropsychological tests.
The tests include basic tasks such as naming objects, learning word lists, but also widely used tools such as the MMSE.
We show that using word alternatives helps to mitigate recognition errors and subsequently improves correlation with expert scores.
- Score: 0.18472148461613155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For dementia screening and monitoring, standardized tests play a key role in
clinical routine since they aim at minimizing subjectivity by measuring
performance on a variety of cognitive tasks. In this paper, we report on a
study that consists of a semi-standardized history taking followed by two
standardized neuropsychological tests, namely the SKT and the CERAD-NB. The
tests include basic tasks such as naming objects, learning word lists, but also
widely used tools such as the MMSE. Most of the tasks are performed verbally
and should thus be suitable for automated scoring based on transcripts. For the
first batch of 30 patients, we analyze the correlation between expert manual
evaluations and automatic evaluations based on manual and automatic
transcriptions. For both SKT and CERAD-NB, we observe high to perfect
correlations using manual transcripts; for certain tasks with lower
correlation, the automatic scoring is stricter than the human reference since
it is limited to the audio. Using automatic transcriptions, correlations drop
as expected and are related to recognition accuracy; however, we still observe
high correlations of up to 0.98 (SKT) and 0.85 (CERAD-NB). We show that using
word alternatives helps to mitigate recognition errors and subsequently
improves correlation with expert scores.
Related papers
- Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge [51.93909886542317]
We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human behavior and automatic evaluation methods.
We propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z) - Automatic Assessment of Oral Reading Accuracy for Reading Diagnostics [9.168525887419388]
We evaluate six state-of-the-art ASR-based systems for automatically assessing Dutch oral reading accuracy using Kaldi and Whisper.
Results show our most successful system reached substantial agreement with human evaluations.
arXiv Detail & Related papers (2023-06-06T06:49:58Z) - An Investigation of Evaluation Metrics for Automated Medical Note
Generation [5.094623170336122]
We study evaluation methods and metrics for the automatic generation of clinical notes from medical conversations.
To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts.
arXiv Detail & Related papers (2023-05-27T04:34:58Z) - Consultation Checklists: Standardising the Human Evaluation of Medical
Note Generation [58.54483567073125]
We propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists.
We observed good levels of inter-annotator agreement in a first evaluation study using the protocol.
arXiv Detail & Related papers (2022-11-17T10:54:28Z) - Exploiting prompt learning with pre-trained language models for
Alzheimer's Disease detection [70.86672569101536]
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and to delay further progression.
This paper investigates the use of prompt-based fine-tuning of PLMs that consistently uses AD classification errors as the training objective function.
arXiv Detail & Related papers (2022-10-29T09:18:41Z) - TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and
Punctuation model evaluation and selection [1.4720080476520687]
Punctuation and readability are key to readability in Automatic Speech Recognition.
Human evaluation is expensive, time-consuming, and suffers from large inter-observer variability.
We present TRScore, a novel readability measure using the GPT model to evaluate different segmentation and punctuation systems.
arXiv Detail & Related papers (2022-10-27T01:11:32Z) - Going Beyond the Cookie Theft Picture Test: Detecting Cognitive
Impairments using Acoustic Features [0.18472148461613155]
We show that acoustic features from standardized tests can be used to reliably discriminate cognitively impaired individuals from non-impaired ones.
We provide evidence that even features extracted from random speech samples of the interview can be a discriminator of cognitive impairment.
arXiv Detail & Related papers (2022-06-10T12:04:22Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - NUVA: A Naming Utterance Verifier for Aphasia Treatment [49.114436579008476]
Assessment of speech performance using picture naming tasks is a key method for both diagnosis and monitoring of responses to treatment interventions by people with aphasia (PWA)
Here we present NUVA, an utterance verification system incorporating a deep learning element that classifies 'correct' versus'incorrect' naming attempts from aphasic stroke patients.
When tested on eight native British-English speaking PWA the system's performance accuracy ranged between 83.6% to 93.6%, with a 10-fold cross-validation mean of 89.5%.
arXiv Detail & Related papers (2021-02-10T13:00:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.